Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here .

Loading metrics

Open Access

Peer-reviewed

Research Article

An empirical evaluation of Lex/Yacc and ANTLR parser generation tools

Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

* E-mail: [email protected] , [email protected]

Affiliations University of Oviedo, Computer Science Department, Oviedo, Spain, Munster Technological University, Cork Institute of Technology, Computer Science Department, Rossa Avenue, Bishopstown, Cork, Ireland

ORCID logo

Roles Data curation, Investigation, Methodology, Resources, Software, Validation, Writing – review & editing

Affiliation University of Oviedo, Computer Science Department, Oviedo, Spain

  • Francisco Ortin, 
  • Jose Quiroga, 
  • Oscar Rodriguez-Prieto, 
  • Miguel Garcia

PLOS

  • Published: March 3, 2022
  • https://doi.org/10.1371/journal.pone.0264326
  • Peer Review
  • Reader Comments

Table 1

Parsers are used in different software development scenarios such as compiler construction, data format processing, machine-level translation, and natural language processing. Due to the widespread usage of parsers, there exist different tools aimed at automizing their generation. Two of the most common parser generation tools are the classic Lex/Yacc and ANTLR. Even though ANTLR provides more advanced features, Lex/Yacc is still the preferred choice in many university courses. There exist different qualitative comparisons of the features provided by both approaches, but no study evaluates empirical features such as language implementor productivity and tool simplicity, intuitiveness, and maintainability. In this article, we present such an empirical study by conducting an experiment with undergraduate students of a Software Engineering degree. Two random groups of students implement the same language using a different parser generator, and we statistically compare their performance with different measures. Under the context of the academic study conducted, ANTLR has shown significant differences for most of the empirical features measured.

Citation: Ortin F, Quiroga J, Rodriguez-Prieto O, Garcia M (2022) An empirical evaluation of Lex/Yacc and ANTLR parser generation tools. PLoS ONE 17(3): e0264326. https://doi.org/10.1371/journal.pone.0264326

Editor: Sergio Consoli, European Commission, ITALY

Received: October 25, 2021; Accepted: February 8, 2022; Published: March 3, 2022

Copyright: © 2022 Ortin et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All the raw data obtained from conducting all the experiments are available for download from https://reflection.uniovi.es/download/2022/plos-one .

Funding: This work has been partially funded by the Spanish Department of Science, Innovation, and Universities: project RTI2018-099235-B-I00. The authors have also received funds from the University of Oviedo through its support to official research groups (GR-2011-0040).

Competing interests: The authors have declared that no competing interests exist.

Introduction

Parsing—also known as syntax or syntactic analysis—is the process of analyzing a string of terminal symbols conforming to the rules of a formal grammar [ 1 ]. Such a grammar may describe a natural language ( e.g ., English or French), a computer programming language, or even a data format. The process of recognizing the terminal symbols, also called tokens, from a sequence of characters is called lexical analysis [ 2 ]. Lexical analyzers are commonly called lexers or scanners.

Parsers are software components that, using a lexer to recognize the terminal symbols of a given language, analyze an input, check its correct syntax, and build a tree that represents the input program with a hierarchical data structure [ 1 ]. Parsers are used for different tasks in computer science such as performing machine-level translation of programs ( e.g ., (de)compilers, transpilers, and (dis)assemblers), creating software development tools ( e.g ., profilers, debuggers, and linkers), natural language processing ( e.g ., dependency and constituency parsing), and data format processing ( e.g ., JSON, GraphViz DOT, and DNS zone files).

The importance of parsing in computer science has implied the existence of different tools aimed at the automatic generation of parsers [ 3 ]. By using these tools, the user describes the syntax of the language to be recognized, and the tool generates a parser for the specified language. There exist plenty of parser generators, such as Yacc/Bison, ANTLR, JavaCC, LISA, SableCC, and Coco/R, among many others [ 4 ]—some of them provide additional features for compiler construction beyond parsing.

Although these tools have different properties such as the parsing strategy or the programming language of the generated code, most software developers opt for either the classic Lex/Yacc approach or the ANTLR 4 parser generator [ 5 ]. Parser generators are also used to teach programming language and compiler implementation courses in computer science degrees. Table 1 shows the parser generation tools used in the compiler construction courses of the top 10 universities, according to the four following rankings (computer science sections): Shanghai Consultancy [ 6 ], Quacquarelli Symonds [ 7 ], Times Higher Education [ 8 ], and US News [ 9 ].

thumbnail

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

https://doi.org/10.1371/journal.pone.0264326.t001

As shown in Table 1 , most courses (13 out of 16) use variants of the classic Lex/Yacc approach: Bison is the GNU implementation of Yacc; Flex is a free and open-source version of Lex; OCamlLex, OCamlYacc, and Menhir are Lex/Yacc implementations for the OCaml programming language; ML-Lex and ML-Yacc for standard ML; and JFlex and CUP for Java. Only the Computer Language Engineering course delivered at MIT uses ANTLR, despite its advanced features (see next sections) and its widespread usage in the development of different products such as Hibernate, Twitter search, Hadoop, and Oracle SQL developer IDE [ 10 ].

Due to the existing differences between Lex/Yacc and ANTLR, the main contribution of this article is an empirical comparison of these two parser generation tools. To that aim, we randomly divide into two groups the students of a Programming Language Design and Implementation course in a Software Engineering degree. Each group is taught a different parser generator and requested to develop an imperative procedural programming language. We then measure different variables considering the students’ performance when using each tool, and their opinion about the parser generator used. These data provide evidence, under the context of our study, to undertake an empirical comparison of the two parser generators.

The following section discusses the related work. Then, we present a qualitative comparison of Lex/Yacc and ANTLR. For the quantitative comparison, we describe the methods used in our study, the results, and different discussions. Finally, we present the conclusions of our work.

Related work

Daniela da Cruz et al . present a qualitative comparison of Yacc, LISA, and ANTLR 3 parser generators [ 11 ]. To that aim, they define the Lavanda Domain-Specific Language (DSL) whose goal is to compute the number of laundry bags a company daily needs to send to wash. They implement Lavanda as an interpreter with each of the three tools. The Lavanda semantics is specified as an attribute grammar that is translated to Yacc, LISA, and ANTLR 3. Based on the three implementations, the authors present a qualitative evaluation where they analyze different features divided into three groups: language specification, parser generator, and generated code. Overall, they state that Yacc is the poorest tool, and ANTLR 3 and LISA are very similar, regarding the comparison criteria utilized in their study.

Klint et al . perform an empirical analysis of the benefits of using tools for the implementation of DSLs [ 12 ]. In their study, they select six experts to implement W aebric , a little language for XHTML markup generation used for website creation. Each expert develops a different language implementation: Java, C#, and JavaScript implementation with no tools; and the other three using ANTLR, Microsoft “M”, and OMeta. For the comparison, they measure different quantitative variables such as volume (number of modules, units, and lines of code), cyclomatic complexity, and percentage of code duplication. They also present a qualitative evaluation comparing the maintainability of each implementation. Their evaluation concludes that the maintainability of the DSL implementations is higher when generators are used.

Language workbenches are tools that lower the development costs of implementing new languages and their associated tools [ 13 ]. They include different components, such as parser generators, to make language-oriented programming environments practical [ 14 ]. Erdweg et al . present empirical data on the implementation of a tiny Questionnaire Language (QL) to create questionnaires forms using ten language workbenches (Ensō, Más, MetaEdit+, MPS, Onion, Rascal, Spoofax, SugarJ, Whole, and Xtext) [ 15 ]. First, they analyze to what extent each language workbench is able to support all the features of the QL language. Then, they measure the source lines of code and the number of model elements used. No language workbench realizes all the features. The features less widely supported are type checking, DSL program debugging, quick fixes, and live translation.

Josef Grosch performs a qualitative and quantitative comparison of lexer (Lex, Flex, and Rex) and parser generators (Yacc, Bison, PGS, Lalr, and Ell) [ 16 ]. He first presents a qualitative comparison of both kinds of tools by analyzing their features. Then, Grosch compares the size of the lexer/parser table and the generated program, generation time, and execution time of the code produced. For lexical analysis, Rex is the tool with the lowest generation times and table and lexer sizes, and it generates the fastest lexers. For the parser generation tools, Bison provides the lowest generation times, and table and parser sizes, whereas Ell generates the fastest parsers.

Hyacc is a Yacc-compatible open-source parser generator that accepts all LR(1) grammars and generates full LR(0), LALR(1), LR(1), and partial LR(k) parsers [ 17 ]. Using Hyacc, Chen and Pager perform an empirical comparison of runtime performance and memory consumption of the most common LR(1) parsing algorithms [ 18 ]. For the evaluation, they used 17 simple grammars and 13 specifications of real programming languages. The Knuth LR(1) algorithm uses more memory than the rest of the algorithms, and Bison LALR(1) is the one with the lowest memory consumption. Regarding runtime performance, Knuth LR(1) is the slowest algorithm and LR(0) runs the fastest.

Renée McCauley performs a shallow analysis of parser generation tools and how they could be used in the delivery of different courses beyond programming language design and implementation, such as operating systems, software engineering, and even human-computer interaction [ 19 ]. Parser generation tools represent a good example of principles that have endured, because the underlying theory and algorithms are well understood and used to implement different kinds of tools. McCauley indicates that ANTLR is more powerful than the traditional Lex/Yacc approach, mainly because of its lexer and parser integration, automatic parse tree generation, and its tree walker grammars.

Features of Lex/Yacc and ANTLR

Table 2 compares the main features of the two parser generators. Yacc follows a bottom-up parsing strategy, creating a node of the parse tree only when its child nodes have been created [ 2 ]. It implements an LALR(1) (Look-Ahead LR) parser, which is a simplification of general LR(1) bottom-up parsers [ 20 ]. To know whether the parser must analyze another token from the input or reduce one production from the grammar, the algorithm analyzes one single token (lookahead).

thumbnail

https://doi.org/10.1371/journal.pone.0264326.t002

On the contrary, ANTLR follows a top-down parsing strategy, creating the parent nodes of the parse tree before their children. It implements an adaptive LL(*) parsing algorithm [ 21 ] that analyzes a finite but not fixed number of tokens (lookahead) [ 22 ]. At parsing, it launches different subparsers when multiple productions could be derived, exploring all possible paths. Then, the subparser with the minimum lookahead depth that uniquely predicts a production is selected [ 22 ]. This algorithm has an O ( n 4 ) theoretical computation complexity— n being the number of nodes in the parse tree—, but it has been experimentally measured to be linear in both computation and space for real programming language implementations [ 22 ].

Yacc uses the BNF (Backus-Naur Form) notation to specify language grammars [ 23 ], while ANTLR supports an extended version that includes different operators including those provided in common regular expressions [ 10 ]. This feature allows language implementors to describe grammars with fewer productions. Moreover, simpler grammars produce smaller parse trees, which are automatically generated by ANTLR (Yacc does not support this feature). ANTLR uses the same grammar notation and parsing strategy for both lexical and syntactic language specifications. Yacc only provides syntax specification, while lexical analysis should be implemented apart, commonly using Lex, which generates Deterministic Finite Automata (DFA) [ 23 ].

One language specification in ANTLR can be used to generate parsers in different programming languages (see Table 2 ). It provides listeners to decouple grammars from application code (semantic actions), encapsulating its behavior in one module apart from the grammar, instead of fracturing and dispersing it across the grammar. Unfortunately, Yacc does not provide this option, so the user must write the semantic actions embedded across the grammar.

Both tools provide right recursion. ANTLR 4 has improved one of its main limitations of the previous versions: the support of direct left recursion. Top-down parsers do not provide left recursion, but ANTLR 4 rewrites direct left recursive productions into non-left-recursive equivalents to support direct left recursion [ 10 ]. This is an important benefit to allow writing the classic expression productions with left and right direct recursion.

ANTLR provides semantic predicates to resolve some grammar ambiguities. For example, f(i) in Visual Basic could mean array indexing and function invocation. By storing the type of f upon its definition, a semantic predicate could select the appropriate production by consulting the symbol table. As shown in Table 2 , ANTLR has different plugins for existing IDEs to help the user in the language implementation process. It also provides TestRig, a testing tool that displays information about how the parser and lexer work at runtime, and different views of the parse tree for a given input.

After a brief qualitative comparison of Lex/Yacc and ANTLR, we describe the methodology we used for the empirical comparison. We followed the guidelines by Kitchenham et al . for empirical research works in software engineering [ 24 ].

We conducted an experiment with undergraduate students of a Programming Language Design and Implementation course [ 25 ] in a Software Engineering degree [ 26 ], at the University of Oviedo (Spain). The course is delivered in the second semester of the third year. The students have previously attended five programming courses, three of them in Java, and the other two in C# and Python. Such courses cover different topics such as procedural, object-oriented, functional, and concurrent programming [ 27 ], and human-computer interaction [ 26 ]. They have also taken other courses that include Java programming assignments such as software design, algorithmics, data structures, web development, databases, numeric computation, distributed systems, information repositories, computer architecture, and operating systems [ 26 ]. In four of these courses, the students have to implement a real Java application following a project-based learning approach.

The Programming Language Design and Implementation course is delivered through lectures and laboratory classes, summing 58 class hours (30 hours for labs and 28 for lectures) along the second semester (6 ECTS). Lectures introduce the theoretical concepts of programming language design. Labs ( Table 3 ) follow a project-based learning approach, and they are mainly aimed at implementing a compiler of a medium-complexity imperative programming language. That language provides integer, real and character built-in types, arrays, structs, functions, global and local variables, arithmetical, logical and comparison expressions, literals, identifiers, and conditional and iterative control flow statements [ 25 ]. The compiler is implemented in the Java programming language.

thumbnail

https://doi.org/10.1371/journal.pone.0264326.t003

The course has two parts. In the first seven weeks, the students have to implement the lexical and syntactic analyzers, using a generation tool. The parser should return an Abstract Syntax Tree (AST) if no errors are detected for the input program. In lab 7, it is checked that the work of students is completed, giving detailed feedback about the mistakes they need to correct.

In the second part of the course, they traverse the AST to perform semantic analysis and code generation for a stack-based virtual machine [ 30 ]. The evaluation consists of a theory exam (50%) and a lab exam where the students must extend their language implementation (50%). Both evaluations are performed after delivering all the lectures and labs.

Our study took place in the second semester of the 2020-2021 academic year. The 183 students (131 males and 52 females) enrolled in the course were divided into two random groups: the first group had 92 students (67 males, 25 females, average age 22.8, and standard deviation 2.04) and the second one was 91 (64 males, 27 females, average age 22.7, and standard deviation 2.13). Retakers were not included in our study, because they have previous knowledge about the course and experience in the utilization of Lex/Yacc (the tool used in the previous years).

For the first part of the course (the first seven weeks), lectures and labs for the first group were delivered using BYaccJ/JFlex (Java versions of the classic Yacc/Lex generators). For the second group, ANTLR was used instead. The second part of the course has the same contents for both groups, since there is no dependency on any generation tool. Both groups implemented the very same programming language, and they had the same theory and lab exams.

We analyze the influence of Lex/Yacc and ANTLR generators in students’ performance (students are the experimental units) with different methods. First, we measure the completion level of students’ work after each lab focused on the use of lexer and parser generators ( i.e ., labs 4 to 6 in Table 3 ). Such completion level is measured with a percentage indicating to what extent all the functional requirements of each lab have been met by the student. To this aim, we define a rubric and the lab instructor annotates the completion percentage for each student, after the lab. We perform the same analysis when we review their work after the first part of the course (lab 7) and for the whole compiler (lab 15).

The students have one week to finish their work before the next lab, if they do not manage to finish it in the 2 hours of each lab. At the beginning of each lab, we ask them to record the number of additional autonomous hours it took them to finish their work for the last lab. We use these data to compare how long it took the students to finish the labs related to lexing and parsing.

To gather the student’s opinion about the lexer and parser generators used, we asked them to fill in the anonymous questionnaire shown in Table 4 , published online with Google Forms. They filled in the questionnaire after the labs where the generators were first used (labs 4 and 5), and after implementing the first part of the language (lab 7) and the whole compiler (lab 15). Questions 4 and 5 were only asked in lab 15, since students had yet not implemented the semantic analysis and code generation modules in labs 4, 5, and 7. We compute the Cronbach’s α coefficient [ 31 ] to measure the internal consistency of our questionnaire. The results— α = 0.862 for labs 4, 5, and 7 and α = 0.867 for lab 15—show its good reliability [ 32 ]. Likewise, we use the Krippendorff’s α coefficient for ordinal data to measure inter-rater reliability [ 33 ]. Substantial reliability ( α >0.8) is obtained for all the questionnaires but lab 15 for Lex/Yacc, which shows modest reliability ( α = 0.781) [ 34 ].

thumbnail

https://doi.org/10.1371/journal.pone.0264326.t004

Besides measuring the students’ performance and opinion after the labs, we measure their final performance. To this aim, we compare the pass and fail rates of both groups (Lex/Yacc and ANTLR) and the number of students who took the exams. We also measure and compare the marks of the lab exams and the final grades achieved by the students. We want to see if there are statistically significant differences between the values of the two groups of students. Since the two groups are independent and students’ marks are normally distributed (Shapiro-Wilk test [ 35 ], p-value>0.1 for all the distributions), we apply an unpaired two-tailed t -test ( α = 0.05) [ 36 ]—the null hypothesis ( H 0 ) states that there is no significant difference between the means of the two groups.

Ever since the creation of the Programming Language Design and Implementation course, the students had implemented their lexical and syntax analyzers with the BYaccJ and JFlex generators. 2020-2021 is the first academic year we introduce the use of ANTLR. Therefore, we also analyze whether the use of a new parser generator (ANTLR) produces any significant difference with students’ performance of the previous years. We conduct a statistical hypothesis test (unpaired two-tailed t -test) to see whether the use of ANTLR causes any difference with the Lex/Yacc approach of the previous years.

All the statistical computations were performed with IBM SPSS Statistics 27.0.1. The raw data obtained from conducting all the experiments are available for download from [ 37 ]. All the data were collected anonymously and voluntarily, and we had no access to personal data.

Table 5 shows the percentage of students who were able to complete different activities related to the implementation of lexical and syntactic analyzers. We consider the work to be completed when it fulfills at least 95% of the requirements with minor mistakes. Labs 4, 5, and 6 are focused on, respectively, lexer, parser, and AST construction. We can see how, on average, the group using ANTLR presents higher completion percentages (differences range from 5.4% to 17.3%).

thumbnail

https://doi.org/10.1371/journal.pone.0264326.t005

Likewise, Table 6 presents the percentage of work that students managed to complete at each of the labs about lexer and parser generation. The greatest p-value obtained for the unpaired two-tailed t -tests is 0.015. Table 6 also presents the power of the tests. The power of a hypothesis test is the probability that the test correctly rejects the null hypothesis, denoted by 1- β ( β represents the probability of making a type II error) [ 38 ]. For all the tests, we obtain values above 0.8, which is the threshold commonly taken as a standard for adequacy for the given sample sizes [ 38 ]. Therefore, we reject the null hypothesis and hence conclude that there are significant differences between the two groups. On average, ANTLR students completed from 6% to 15.5% more work in the labs than the Lex/Yacc group.

thumbnail

https://doi.org/10.1371/journal.pone.0264326.t006

Table 6 also shows the number of additional hours it took the students to complete each lab autonomously. All the p-values for the t -tests are lower than 0.05 and 1- β >0.8, so differences between both groups are significant. Lex/Yacc students needed, on average, from 1.7 to 3.5 more hours than the ANTLR group.

In Table 7 , we summarize the Likert scale answers for the questionnaire in Table 4 . Likewise, Fig 1 shows the distribution of those answers and a visual comparison of the Lex/Yacc and ANTLR groups. The arithmetic mean of the ANTLR group is higher than Lex/Yacc, for all the questions and labs. All the ANTLR median and mode values are equal or higher than those for Lex/Yacc—92% of the median values and 64% for mode are strictly higher for ANTLR.

thumbnail

https://doi.org/10.1371/journal.pone.0264326.g001

thumbnail

https://doi.org/10.1371/journal.pone.0264326.t007

Fig 1 also shows how, after using different lexer and parser generators, the students think that the simplicity (Q1), intuitiveness (Q2), and maintainability (Q3) capabilities of ANTLR are greater than Lex/Yacc. For both groups, the students consider that modifying the parser is easier than modifying the semantic analyzer (Q4) and code generator (Q5), and these differences are higher for the students who used ANTLR.

The Likert scale for the questions Q1-Q5 of the questionnaire in Table 4 can be combined to provide a quantitative measure [ 39 ]. In our case, the combined values represent the student’s opinion about the benefits (simplicity, intuitiveness, and maintainability) of the generator used. Fig 2 shows the sum of all the Likert values for each lab, relative to the number of questions to use the same scale (1 to 5). 95% confidence intervals are shown as whiskers. We can see that, for all the labs, there are statistically significant differences between the Lex/Yacc and ANTLR groups ( i.e ., confidence intervals do not overlap [ 40 ] and p-values are lower than 0.01). The average values of ANTLR are from 10% to 37% higher than Lex/Yacc.

thumbnail

Whiskers represent 95% confidence intervals and the points denote their mean values.

https://doi.org/10.1371/journal.pone.0264326.g002

The second and third last columns in Table 8 detail some evaluation rates for the Programming Language Design and Implementation course in the 2020-2021 academic year, when we conducted our experiment. Those students who used ANTLR exhibit not only higher pass rates but also take more exams.

thumbnail

https://doi.org/10.1371/journal.pone.0264326.t008

Fig 3 shows the evolution of exam attendance and pass and fail rates of the last years (for the very same course). The two last bars are the Lex/Yacc and ANTLR groups of the academic year under study. The grey area shows the 95% confidence interval for the previous seven years, to see if the values of each group are within those intervals. We can see how the Lex/Yacc group is included in the confidence interval; i.e ., no significant difference exists [ 40 ]. On the contrary, exam attendance and pass rates of the ANTLR group are higher than the 95% confidence interval, while the fail rate is below that interval.

thumbnail

The grey area delimitates the 95% confidence intervals of the years before our study (from 2013-2014 to 2019-2020).

https://doi.org/10.1371/journal.pone.0264326.g003

Fig 4 presents the same comparison with the previous years, but considering the lab exams and the students’ final marks. In the last academic year, the students of the ANTLR group obtained significantly higher marks than those in the Lex/Yacc group (95% confidence intervals do not overlap) for both the lab exams and the final marks. Compared to the previous years, Lex/Yacc group shows no difference for both kinds of assessments. However, ANTLR has significantly higher values than the rest of the courses for the lab exam, and for all of them but two (2013-14 and 2016-17) in the case of the final marks. The average lab and final marks of the students of the ANTLR group are the highest ones among all the years.

thumbnail

https://doi.org/10.1371/journal.pone.0264326.g004

We first discuss the students’ performance when they first use the generators. For their first lexical specification (lab 4), 4.5% more ANTLR students finished their work in the lab, the percentage of completion was 15.5% higher, they require 1.7 fewer hours of autonomous work, and they consider ANTLR to be simpler, more intuitive, and maintainable than Lex. These differences are even higher after lab 5, when they build a parser. These data involve significant differences in their first contact with both lexer and parser generators.

We also consider the influence of both tools when they must be used to build a complete language parser returning an AST (lab 7) and a whole compiler (lab 15). For these two scenarios, differences are also significant. The students of the ANTLR group who managed to implement their parser and compiler on time were, respectively, 8.8% and 18.3% more than Lex/Yacc group. In their opinion, the benefits of ANTLR are higher than those using Lex/Yacc and present significant differences. This indicates that the differences detected in their first contact with the tools are maintained when they have more experience using the generators.

The use of ANTLR has also implied an increase in students’ performance. For the pass rates, lab exams, and students’ final marks, ANTLR show statistically significant differences, not only with the Lex/Yacc group, but also with the values of the previous academic years. This leads us to think that ANTLR eases the implementation of the programming language. It might also help to understand the theoretical concepts, but there were no significant differences in the theory marks (the higher final marks of the ANTLR group were caused by the increase in their lab marks).

The number of students taking the exam has also been increased in the ANTLR group, and it is significantly higher than in previous years. This is something important, because the Programming Language Design and Implementation course has quite low rates of students taking the exams. Its average value for the past courses is 66.5%, while the ANTLR group reaches 89.9%. This fact is derived from the greater number of students finishing the language implementation (the lab exam consists in extending their language). However, it could also be interpreted as a higher confidence level to modify and extend their compiler.

Another qualitative difference detected by lecturers and students, not included in the quantitative evaluation, is the plugin support and the TestRig testing tool provided by ANTLR. With these tools, the students can interactively modify the grammar, see the parse tree for the recognized input, read the error message if the input is not correct, and see how all that changes while they modify the input. This is done within the IDE, improving the lexer/parser generation, compilation, execution, and tree visualization batch process used in the Lex/Yacc approach.

Threats to validity

In our study, we selected two random groups of students to perform our empirical comparison. The only difference between the contents of the lectures and labs delivered to both groups was about the lexer and parser generator. Two different instructors delivered the lectures that might have influenced the differences in students’ performance. However, there were no significant differences in the marks for each group in the previous years with the same instructors. For labs, most instructors teach Lex/Yacc and ANTLR groups (there are 10 different lab groups).

Regarding the differences with the previous years, one minor change is that in 2016 the weights of theory and lab marks were modified from 30%-70% to 50%-50%. In 2020, half of the course was delivered online because of the COVID-19 pandemic [ 41 ]. Nonetheless, as detailed in Table 8 and Fig 3 , these changes did not cause significant changes in the students’ performance. Moreover, our control group (Lex/Yacc) for the last academic year falls within the 95% confidence interval of the previous years, for all the measures ( Table 8 ).

The conducted experiment measures the implementation of a medium-complexity imperative language. It is hard to generalize the comparison of parser generation tools for any kind of language and data format. We think the differences measured would be maintained (or even increased) for more complex languages, because ANTLR grammars tend to be more maintainable than those written in Lex/Yacc. However, it is hard to extrapolate the results to parsing scenarios of tiny languages.

Likewise, the study was conducted with Software Engineering students with Java programming skills, all of them beginners of the parser generation tools (students retaking the course were not considered). According to Zelkowitz and Wallace, this kind of validation method is a controlled experiment within a synthetic environment, where two technologies are compared in a laboratory setting [ 42 ]. Due to the validation method and the type of user used, the results should not be generalized to state that one tool is better than another one.

In an empirical research study where users utilize two different tools, the participants are aware of the tool they are using, so a blind experiment is hardly feasible [ 24 ]. Therefore, the student’s expectations about the use of ANTLR may have represented a psychological factor that might have influenced their opinion of that tool. Although the lecturers did not compare ANTLR with Lex/Yacc, the students may have known they were using different tools because they could have spoken one to another. However, the measurements that are not based on student’s opinion (completion percentages, work time, and evaluation data) seem to back up the answers that the students gave in the questionnaires.

Conclusions

The empirical comparison undertaken shows that, for the implementation of a programming language of medium complexity by year-3 students of a Software Engineering degree, the ANTLR tool shows significant benefits compared to Lex/Yacc. The students show significantly higher performance when using ANTLR, for both the initial tasks of lexer and parser implementation and to modify their lexical and syntax specifications of the whole compiler. The use of ANTLR increased pass rates, exam attendance percentages, lab marks, and final grades, compared to the previous academic years when Lex/Yacc was used. According to the student’s opinion, ANTLR provides higher simplicity, intuitiveness, and maintainability than Lex/Yacc.

We evaluated both tools for the development of a medium-complexity imperative language, using the Java programming language. This study could be extended to other kinds of programming languages, data formats, and implementation languages. The evaluation could also be enhanced with other parser generators [ 4 ] and different tools for compiler construction such as tree walkers [ 10 ], type checker generators [ 43 , 44 ], and language implementation frameworks [ 45 ].

  • 1. Aho AV, Lam MS, Sethi R, Ullman JD. Compilers: Principles, Techniques, and Tools (2nd Edition). Addison Wesley; 2006.
  • 2. Louden KC. Compiler Construction: Principles and Practice. USA: PWS Publishing Co.; 1997.
  • View Article
  • Google Scholar
  • 4. Jarble. Comparison of parser generators; 2022. https://en.wikipedia.org/wiki/Comparison_of_parser_generators .
  • 5. Friedrich AC. CMT, C Maintenance Tool, research manual. Carlow Institute of Technology, Ireland; 2009. C00132716.
  • 6. Shanghai Consultancy. Global Ranking of Academic Subjects, Computer Science; 2022. https://www.shanghairanking.com/rankings/gras/2021/RS0210 .
  • 7. Quacquarelli Symonds. QS World University Rankings—Computer Science; 2022. https://content.qs.com/wur/Computer_Science.htm .
  • 8. Times Higher Education. World University Rankings 2021 by subject: Computer Science; 2022. https://www.timeshighereducation.com/world-university-rankings/2021/subject-ranking/computer-science#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats .
  • 9. USNews. Best Global Universities for Computer Science; 2022. https://www.usnews.com/education/best-global-universities/computer-science?int=994b08 .
  • 10. Parr T. The Definitive ANTLR 4 Reference. 2nd ed. Raleigh, NC: Pragmatic Bookshelf; 2013.
  • 11. da Cruz D, Varanda MJ, Verón M, Fonseca R, Henriques PR. Comparing Generators for Language-based Tools. Departamento de Informatica, Universidade do Minho, CCTC, Braga, Portugal; 2007.
  • 12. Klint P, van der Storm T, Vinju J. On the Impact of DSL Tools on the Maintainability of Language Implementations. In: Proceedings of the Tenth Workshop on Language Descriptions, Tools and Applications. LDTA’10. New York, NY, USA: Association for Computing Machinery; 2010. p. 1–9.
  • 13. Fowler M. Language Workbenches: The Killer-App for Domain Specific Languages?; 2005. http://martinfowler.com/articles/languageWorkbench.html .
  • 14. Dmitriev S. Language Oriented Programming: The Next Programming Paradigm; 2004. https://resources.jetbrains.com/storage/products/mps/docs/Language_Oriented_Programming.pdf .
  • 16. Grosch J. Generators for high-speed front-ends. In: Hammer D, editor. Compiler Compilers and High Speed Compilation. Berlin, Heidelberg: Springer Berlin Heidelberg; 1989. p. 81–92.
  • 17. Chen X. Hyacc parser generator; 2022. http://hyacc.sourceforge.net .
  • 18. Chen X, Pager D. Full LR(1) Parser Generator Hyacc and Study on the Performance of LR(1) Algorithms. In: Proceedings of The Fourth International C* Conference on Computer Science and Software Engineering. C3S2E’11. New York, NY, USA: Association for Computing Machinery; 2011. p. 83–92.
  • 21. Parr T, Fisher K. LL(*): The Foundation of the ANTLR Parser Generator. In: Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation. PLDI’11. New York, NY, USA: Association for Computing Machinery; 2011. p. 425–436.
  • 22. Parr T, Harwell S, Fisher K. Adaptive LL(*) Parsing: The Power of Dynamic Analysis. In: Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications. OOPSLA’14. New York, NY, USA: Association for Computing Machinery; 2014. p. 579–598.
  • 23. Levine JR, Mason T, Brown D. Lex & Yacc (2nd Ed.). USA: O’Reilly & Associates, Inc.; 1992.
  • 26. University of Oviedo. School of Computer Engineering, Bachelor of Software Engineering; 2022. https://ingenieriainformatica.uniovi.es/infoacademica/grado .
  • 28. Gamma E, Helm R, Johnson R, Vlissides JM. Design Patterns: Elements of Reusable Object-Oriented Software. 1st ed. Addison-Wesley Professional; 1994.
  • 29. Watt DA. Programming language processors—compilers and interpreters. Prentice Hall International Series in Computer Science. Prentice Hall; 1993.
  • 33. Krippendorff K. Content Analysis: An Introduction to Its Methodology (second edition). Sage Publications; 2004.
  • 37. Ortin F. An empirical evaluation of Lex/Yacc and ANTLR parser generation tools (support material website); 2022. https://reflection.uniovi.es/download/2022/plos-one .
  • 38. Cohen J. Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum Associates; 1988.
  • 40. Georges A, Buytaert D, Eeckhout L. Statistically Rigorous Java Performance Evaluation. In: Proceedings of the 22nd Annual ACM SIGPLAN Conference on Object-oriented Programming Systems and Applications. OOPSLA. New York, NY, USA: ACM; 2007. p. 57–76.
  • 43. Ortin F, Zapico D, Quiroga J, Garcia M. TyS—A Framework to Facilitate the Implementation of Object-Oriented Type Checkers. In: Proceedings of the 26th International Conference on Software Engineering and Knowledge Engineering. SEKE’14; 2014. p. 150–155.
  • 45. Wimmer C, Würthinger T. Truffle: a self-optimizing runtime system. In: Leavens GT, editor. Conference on Systems, Programming, and Applications: Software for Humanity, SPLASH’12, Tucson, AZ, USA, October 21-25, 2012. ACM; 2012. p. 13–14.
  • Programming

lex and yacc: Tools Worth Knowing

This article is about how Linux was used to program a Sun machine to manipulate well-log recordings to support finding oil and gas exploration in Western Canada. It will describe the problem, provide enough background to make the problem understandable, and then describe how the two fascinating UNIX utilities lex and yacc were used to let a user describe exactly what he wanted to satisfy his particular need.

In the fall of 1993, I had been recently downsized and was approached by a former colleague for assistance. He, and another former colleague, both of whom were still with my last employer, were what is known in the industry as well-log analysts.

To understand what a log analyst is requires a little knowledge of oil and gas exploration methods. Energy companies, as they like to be known, employ several different categories of professionals to assist them in finding salable quantities of hydrocarbons. Chief among these are the geologists and geophysicists (of which I am one) who study the recordings made in bore holes, or process and examine seismic recordings to identify what are popularly known as “plays” or “anomalies”.

Bore holes are simply the holes left when a drill rig moves off the drilling platform. Generally, these holes are logged by service companies who lower instruments called tools into the hole, and who then record on magnetic tape the readings made by those instruments.

There are many different types of tools, including sonic (which measures the time needed for a pulse of sound energy to travel through the rock wall from one end of the tool to the other), density (a continuous recording of the rock wall density), and gamma ray (a measure of gamma radiation intensity in the rock). These are just a few of the types of measurements that are made, recorded and called logs .

The various logs are examined by geologists to gain an understanding of what was happening when the rocks forming the bore hole were laid down, and what has happened to them subsequently as shallower rocks were created above them.

Geophysicists are more inclined to study seismic recordings which in essence are indirect measurements of the properties of the rocks forming the subsurface. Geophysics and Linux will not be discussed here, but you may find Sid Hellman's “Efficient, User Friendly Seismology”, Linux Journal , August 1995, Issue 16 of interest.

While seismic recordings provide much greater volumes of interpretable data over large areas, well logs are definitive measurements made at single locations, sometimes close together, and sometimes not. Geologists often correlate adjacent well logs to create cross sections of the subsurface, much like seismic recordings would provide. Detailed interpretation of individual logs, however, is often left to the log specialists.

My two acquaintances were log specialists who wanted an additional tool to assist them in the processing and interpretation of individual or combinations of logs. Specifically, they wanted to tell the computer to perform arithmetic operations on individual or some algebraic combination of logs.

For example, they might need to scale a specific log by an arbitrary amount, say 1.73. In another case, they might want to divide one log by another, and then add the result to a third, all before adding a constant and then raising the resulting values to some arbitrary power.

Keeping in mind that logs are composed of individual sample values taken as frequently as every few inches (or centimeters as they are here in Canada and many other places in the world), these example requests would mean a reasonable amount of computation, multiplying every sample of thousands of meters of recorded log values by 1.73, in the first example. The simple scaling problem isn't particularly difficult, but identifying the desired log could be.

The energy company for which my acquaintances worked was using a simple filing method for all the log curves (a curve corresponds to all the recorded samples for one tool in one bore hole) wherein each curve was identified by a name. To this was added some additional information on units and so on, plus all the samples for all the curves for the well. All the data were stored as ASCII. (The file format is known as Log ASCII Standard format, or LAS version 2.0, and although the names for curves were generally the same from well to well, that was not guaranteed.)

As a result, more complicated combinations of curves required a fairly sophisticated and robust mechanism for arbitrary name recognition, while the desired algebraic operation was being described. Given such an interesting challenge, I recognized an opportunity to put some relatively little-used tools to work: lex and yacc .

The program lex is a lexical analyzer. Lexical analysis is the recognition of words in a language. As used in this particular application, lex, or more specifically flex , is used to recognize characters forming the names of log curves, arithmetic operators and algebraic groupings.

flex is a particular example of the lexical analysis programs available for UNIX systems and is the implementation delivered with Linux distributions. The original implementation was done by Mike Lesk and Eric Schmidt at Bell Laboratories and is documented in the book lex & yacc by John R. Levine, Tony Mason & Doug Brown, O'Reilly & Associates, Inc., 1992.

yacc is a language parser. It accepts word items and, given a list of rules describing how these items form larger entities, deduces which items or combinations of items satisfy a rule. This can also be thought of as grammatical analysis. Once a rule is satisfied, a programmer's code is applied to the result.

In the case of English, the words in a sentence can be assigned grammatical types such as noun, verb, adjective and so on. Particular combinations of words form more complex units and these in turn can be described as complete sentences.

For example, the sentence “The lazy dog lay in the sun,” is composed of an article “the”, a preposition “in”, adjective “lazy”, nouns “dog, sun” and a verb “lay”. Combinations of these grammatical items form more complex entities such as noun phrases “The lazy dog” and “in the sun”. The first noun phrase is the subject of the sentence, and the second, in combination with the verb, forms the predicate. Together they form a complete sentence.

It is possible to form parsers for the English language, although given English's many idiosyncrasies, yacc may prove to be inadequate for the task. It may also be that the yacc programmer may become exhausted in trying to describe all the nuances of the language.

yacc was originally developed to provide a mechanism to develop compilers, but it could just as easily be used to create interpreters. For example, BASIC is often an interpreted language and could easily be described by a yacc grammar. Once yacc understood a particular line of BASIC code, it could cause the execution of the equivalent instructions in the native language of the host computer.

Some Linux distributions provide a choice of yacc programs. One can install either (or both) Berkeley yacc or the GNU bison program. You'll probably find them in /usr/bin. They are quite similar; bison was originally derived from yacc, although there has been some divergence over the years.

The combination of lex, yacc and some programmer's C code provides a complete means to interpret and act upon a user's wishes. The lex program uses its regular expression interpretation capability to recognize strings of characters as words or tokens. (The term “words” is used loosely to describe any recognized string of characters.) Once a token is identified, it is passed to yacc where the various rules are applied until some combination of tokens form a recognizable structure to which yacc applies some pre-programmed C code.

As indicated, lex uses regular expressions to recognize strings of characters as items of interest. Regular expressions are composed of special characters which describe acceptable combinations of characters.

For example, regular expressions often use the character . (period) to indicate that any character except a newline (\n) is acceptable.

Similarly, the characters [ and ] (square brackets) are used to indicate acceptance of any of the characters enclosed within them or within the range of characters described between them. For example, the expression [abc] says to accept any of the characters a, b or c; the expression [a-c] says the same thing. A more complicated example might be [a-zA-Z0-9] which says to accept any alphanumeric character.

For a complete description of lex regular expression syntax, see lex & yacc by Levine, Mason and Brown (O'Reilly, 1992).

Once a regular expression matches the text stream being interpreted by lex, code created by the programmer is executed. When used with yacc, this code generally amounts to passing an indication of what was just recognized to yacc for further processing. The indication is a token that yacc knows about, and in fact, these are defined in the yacc portion of the analyzer/parser program so that they are common to both lex and yacc.

Also as indicated, yacc uses a grammar description to decode the meaning of the tokens that lex passes to it. As tokens are passed to yacc, it applies its rules until a single token, or some sequence of tokens, becomes a recognizable structure.

Before a programmer's C code is executed, though, yacc may require several structures or token-structure combinations to be recognized. For example, using our sample sentence, our rules might look like the following:

The first rule says that a sentence is made up of two parts: a subject followed by a predicate. If that rule is satisfied, then the C code between the curly brackets will be executed. To satisfy the first rule, yacc has to recognize both a subject and a predicate. The subsequent rules help yacc to do just that.

For example, the second rule says that a subject is recognized when either a noun or a noun phrase is recognized. A noun is the smallest unit that yacc deals with, and in the yacc grammar, a noun is a token that yacc will want to have lex recognize. Thus, somewhere in the yacc program, a token will be defined (probably called NOUN) that lex and yacc will use to communicate the fact that a noun has been interpreted. How this is done we will see shortly.

Notice that a noun phrase is also used to create the predicate. If a verb is recognized and it is followed by a noun phrase, the predicate is identified. If only the noun phrase is identified, then the subject has been identified.

The example cited is not in yacc syntax, but is meant to provide understanding. Very detailed examples may be found in the references.

You may be wondering how the yacc parser actually works. yacc works as a finite-state machine, and it has a stack (think of this as a memory of what has been seen, as it tries to deduce what the incoming stream of tokens represents).

A finite-state machine records its current condition as each recognizable item is interpreted. For example, as a noun phrase is being interpreted, it moves from state 3 when it receives a preposition to state 4 when the adjective is interpreted and finally to state 5 when the noun is recognized. When the entire phrase has been recognized, it switches to another state, perhaps 37, to note that fact. Please do not attach any particular meaning to the numbers used in this example. They have been chosen arbitrarily to demonstrate how yacc progresses as it interprets the tokens received from lex. You should conclude only that to reach state 5 (noun phrase), yacc must progress through several preceding states, each of which might lead to another final state, depending on the grammar yacc is using.

In other words, given its current state, yacc requests from lex the next token (if it needs it) and places onto the stack its new state. In doing so, it may push the new state onto the stack (as when interpreting the noun phrase), or pop the old state off the stack, replacing it with a new state (as when the noun phrase is completely recognized). These actions are called “shift” and “reduce” and describe pushing and popping states to and from the stack.

When the sentence is finally recognized, yacc accepts it and returns to the calling program (the main program which invoked yacc and indirectly lex). For a complete description of how a yacc parser works, see Inside Xenix by Christopher Morgan, Howard W. Sams and Co., 1986. This reference describes yacc grammars and how yacc parses its input in exquisite detail.

Both tools are coded in a similar manner. There are three sections in each program: declarations, rules and user routines. Each is separated by a line containing only the two characters %% .

For yacc, the declarations section contains the tokens it can use while parsing the input. Each has a unique value greater than 256, and the set of tokens is introduced via %token at the beginning of the line. lex can use the declarations section to define aliases for items that it must recognize while looking for tokens to pass to yacc.

For example, lex needs to know about white space which, while not used in identifying tokens, must be accounted for in some way. Similarly, mathematical symbols such as + or = must be recognized. These are needed in the interpretation of the algebraic statement coded by the user.

Within the rules section, yacc holds its parsing intelligence. This is the section that contains the grammar rules referred to in the English sentence example earlier. In fact, the coding used earlier is typical of a yacc grammar: items to be recognized are separated from the means to recognize them by a colon (:), and alternative means of recognition are separated from each other via a pipe (|) symbol.

lex uses the rules section to contain the regular expressions that allow it to identify tokens to pass to yacc. These expressions may be the aliases from the declaration section, regular expressions, or some combination.

The last section contains C code which may be invoked as each of the tools processes its input.

One requirement is that the yacc tokens be known to the lex program. This is accomplished by including the following statement:

in the lex declarations section and creating it when compiling the yacc program code.

Compilation is accomplished in the following way:

The -d option on yacc's command line creates the y.tab.h file needed by lex.yy.c.

To successfully interpret the user's desired process, the program needs to know which well logs were available for processing. This information is available in the ASCII file selected by the user. This text file contains a one-to-one correspondence between curve description and data values. A very small subset of a typical file is shown in Listing 1.

As can be seen, there are several sections including well information (includes some hole parameters), curve information (notes which curves are in the file) and “A” which holds the recorded data values. Each is introduced with a tilde (~). Because the format of the file is fixed by convention, these are easily parsed, and needed values are stored for subsequent processing.

As written for the client, the program is a Motif application. The user selected the file to be processed; it was read in its entirety and numeric text items were converted to double-precision values.

Besides allowing file and curve merging and editing, there is a menu item for curve processing. Upon selecting this menu item, a dialog box is presented containing a list of available curves and arithmetic operations. The user selects curve names, numeric constants and operations which in turn are displayed as an algebraic operation on a text input line. When satisfied with the mathematical operation, the user clicks OK and the lex and yacc magic occurs. The result is stored as a separate curve and can be included in subsequent operations.

lex processed the incoming algebraic statement with the code shown in Listing 2.

Between lines 1 and 16 are declarations to be used in the program generated by lex. In particular, you will notice the inclusion of the header file y.tab.h which contains the following definitions:

These definitions are used by both lex and yacc to describe the items yacc expects to receive from lex. They are generated by statements 73 to 77 of the yacc source which will be examined shortly.

From lines 17 to 31 of the lex listing are declarations which amount to aliases for common items that we wish lex to recognize. For example, we declare DIGIT to be any single numeric between 0 and 9 on line 21. Doing this allows us to declare INT (an integer) to be one or more DIGIT's.

Lines 33 to 90 contain the rules by which lex interprets incoming text items. For example, on line 34 we recognize an equal sign (=) and return the token EQUAL to the caller. In y.tab.h, EQUAL is defined to be 262.

As you can see, the lex rules simply recognize text items and then advise the caller what was seen in the input stream.

yacc interprets the token stream passed to it by lex with the following code, only a subset of which is shown in Listing 3. The code for the yacc routine (with the calling subroutine do_arithmetic and its accessory functions) was in excess of 900 lines. For those interested, it is available for your perusal from SSC's public FTP site. Listing 3 is a sample indicating what needed to be done.

Like the lex routine, yacc begins with lines to be included in the output code. Programs written for graphical user interfaces sit in a loop waiting for the user to click on something. When the user's needs are so indicated, the GUI-based program calls a function to perform the required action. These “called functions” are popularly called callbacks . In this program, one of the callbacks was do_arithmetic , which in turn called the yacc routine, which in its turn called the lex routine.

In Listing 3, do_arithmetic is described in the first few lines, and a portion of the code may be seen in lines 428 to 532. They are shown only to give some indication of what was being accomplished.

yacc does the work with its rules section beginning at line 79, and ending at line 426. Although too long to be included completely, you can see that an equation is defined to be something called an lhs (left hand side) EQUAL rhs (right hand side) at line 80. Looking down the listing, you will see that an equation may also be described by an expr (expression). When either of these are encountered, yacc pops a key from an internal stack created by a function called push (see near line 557) and then causes a log curve to be returned to the caller by calling another function called get_curve (not shown here, but included with the yacc and lex code).

Between lines 118 and 139, you can see how some of the tokens yacc expects are processed when recognized. The complete listing has many more.

The lex, yacc and supporting code was successfully employed to allow the log analysts to process various log curves. To have written the C code to accomplish the lexical analysis and parsing logic would have taken much longer than the four weeks allowed. As it turned out, this code was much easier to create and debug than it was to introduce into the final Motif application, even though it was written as a callback.

In fact, the number of lines of lex (152) and yacc (953) code were far fewer than the number of lines generated by the two (2765). Of course, one could take the time to write much tighter code than these general purpose tools deliver.

Nevertheless, should you be faced with a similar problem, I strongly recommend using lex and yacc. They are powerful, reliable tools worth knowing.

All listings referred to in this article are available by anonymous download in the file ftp://ftp.linuxjournal.com/pub/lj/listings/issue51/2227.tgz.

lex and yacc: Tools Worth Knowing

You May Like

Demystifying Kubernetes Operators: Creation, Benefits, and Use Cases

Example program for the lex and yacc programs

This section contains example programs for the lex and yacc commands.

Together, these example programs create a simple, desk-calculator program that performs addition, subtraction, multiplication, and division operations. This calculator program also allows you to assign values to variables (each designated by a single, lowercase letter) and then use the variables in calculations. The files that contain the example lex and yacc programs are as follows:

The following descriptions assume that the calc.lex and calc.yacc example programs are located in your current directory.

Compiling the example program

To create the desk calculator example program, do the following:

  • Process the yacc grammar file using the -d optional flag (which informs the yacc command to create a file that defines the tokens used in addition to the C language source code): yacc -d calc.yacc
  • Use the ls command to verify that the following files were created: y.tab.c The C language source file that the yacc command created for the parser y.tab.h A header file containing define statements for the tokens used by the parser
  • Process the lex specification file: lex calc.lex
  • Compile and link the two C language source files: cc y.tab.c lex.yy.c
  • Use the ls command to verify that the following files were created: y.tab.o The object file for the y.tab.c source file lex.yy.o The object file for the lex.yy.c source file a.out The executable program file

Parser source code

The following example shows the contents of the calc.yacc file. This file has entries in all three sections of a yacc grammar file: declarations, rules, and programs.

  • Include standard I/O header file
  • Define global variables
  • Define the list rule as the place to start processing
  • Define the tokens used by the parser
  • Define the operators and their precedence
  • %start - Specifies that the whole input should match stat .
  • %union - By default, the values returned by actions and the lexical analyzer are integers. yacc can also support values of other types, including structures. In addition, yacc keeps track of the types, and inserts appropriate union member names so that the resulting parser will be strictly type checked. The yacc value stack is declared to be a union of the various types of values desired. The user declares the union, and associates union member names to each token and nonterminal symbol having a value. When the value is referenced through a $$ or $n construction, yacc will automatically insert the appropriate union name, so that no unwanted conversions will take place.
  • %type - Makes use of the members of the %union declaration and gives an individual type for the values associated with each part of the grammar.
  • %toksn - Lists the tokens which come from lex tool with their type.

Lexical analyzer source code

Compiler Construction/Case Study 1B

  • 1.1 Lexical Analysis
  • 1.2 Syntax Analysis
  • 1.3 Lexical Analysis with Semantics
  • 1.4.1 absyn.h
  • 1.4.2 absyn.c
  • 1.4.3 Makefile
  • 1.5 Generating Abstract Syntax Trees
  • 1.6 Interpreting

Case Study 1B - C front-end (Lex and Yacc) [ edit | edit source ]

The purpose of this case study is to give an example of a compiler/interpreter front-end written in C using Lex and Yacc. An interpreter is used since it allows a working program to be created with minimal extra effort (after the construction of the front-end). This code could be developed into a compiler by replacing the last phase with a compiler back-end.

The code is shown in an order which underlines the processes of creating a compiler, so the same file will be shown multiple times as it is developed.

The case study develops an interpreter for Very Tiny Basic, which is specified in the Case Study 1 section of the book.

The following steps shall be taken to complete the interpreter:

  • Lexical Analysis
  • Syntax Analysis
  • Lexical Analysis with Semantics
  • Abstract Syntax Trees
  • Generating Abstract Syntax Trees
  • Interpreting

Many important features of a useful compiler/interpreter have been left out, for brevity and simplicity, including:

  • Dealing with Errors
  • Optimization

Also some processes do not apply to such a simple language as Very Tiny Basic, for example Name Tables do not apply since only single characters are used to name variables.

Requirements

  • A C compiler
  • Lex or equivalent ( flex ).
  • Yacc or equivalent ( bison ).
  • (preferred) The Make utility.

Lexical Analysis [ edit | edit source ]

The first draft of the lex file identifies different tokens by returning a certain value in the associated C action. For keywords and operators this is simple, however identifiers, values and comments are trickier.

VTB.l - Version 1

This file can be processed using one of the following

You may wonder where all those values we returned are coming from. They will be created by Yacc grammar file when it is processed.

There are some differences from the Very Tiny Basic - Specification in Case Study 1 , for instance MULT and DIV in place of MULDIV. This is because we need to know the difference between the two. LineNumber and WholeNumber are lexically identical, and so cannot be separated at this time. Defining the use and category of tokens is left until the next stage.

Syntax Analysis [ edit | edit source ]

VTB.y - Version 1

Lexical Analysis with Semantics [ edit | edit source ]

In this version of the lexer, the header file generated by yacc/bison is included. The header file defines the return values and the union that is used to store the semantic values of tokens. This was created according to the %token declarations and the %union part of the grammar file. The lexer now extracts values from some types of tokens, and store the values in the yylval union.

VTB.l - Version 2

Abstract Syntax Trees [ edit | edit source ]

Abstract syntax trees are an Intermediate Representation of the code that are created in memory using data structures. The grammatical structure of the language, which has already been defined and has been written down as a YACC grammar file, is translated into a tree structure. Using YACC & C, this means that most grammar rules and tokens become nodes. Comments are an example of what is not put in the tree.

The grouping of operands is clear within the structure, so tokens such as parentheses do not have to be present in the tree. The same applies to tokens which end blocks of code. This is possible because the rules in the grammar file can use these tokens create the tree in the correct shape.

Illustration of grouping in abstract syntax trees

In this interpreter the Primary/Secondary expression structures could be discarded by collapsing them. This would add complexity to the code, so it is currently not implemented. Rem statements are also kept (without the comment text) since the definition of VTB implies that they are a valid target for Goto. In fact, a goto to a non-existent line is undefined, so this interpreter will issue an error.

absyn.h [ edit | edit source ]

Using pointers and dynamic allocation is a must for the tree structure, or else it would be infinite in size and thus use too many resources (think about it). In order to keep the code nice and neat,

absyn.c [ edit | edit source ]

The purpose of absyn.c is to provide initializing routines for all of the structures that are used in the abstract syntax trees. This code is not very interesting, and it is very repetitive.

Makefile [ edit | edit source ]

Since we are working with standard *nix tools and the normal build system used on *nix is make, it is useful to write a Makefile for the interpreter. Keep in mind that most compilers/interpreters are very large and need a more advanced build system than this example. They may require CVS, autoconf and many makefiles distributed across different directories. Since this one only uses five files, it is quite trivial.

Generating Abstract Syntax Trees [ edit | edit source ]

VTB.y - Version 2 The lexer is now giving values for tokens and the abstract syntax tree structure has been written. Next the grammar file is updated to construct the trees from what the rules and semantic values. All the tree node types are added to the union declaration. Rules must be given types and return the correct type.

Interpreting [ edit | edit source ]

case study of yacc

  • Book:Compiler Construction
  • Wikibooks pages with to-do lists

Navigation menu

Javatpoint Logo

  • Data Structure

Cloud Computing

  • Design Pattern
  • Interview Q

Compiler Tutorial

Basic parsing, predictive parsers, symbol tables, administration, error detection, code generation, code optimization, compiler design mcq.

JavaTpoint

  • Send your Feedback to [email protected]

Help Others, Please Share

facebook

Learn Latest Tutorials

Splunk tutorial

Transact-SQL

Tumblr tutorial

Reinforcement Learning

R Programming tutorial

R Programming

RxJS tutorial

React Native

Python Design Patterns

Python Design Patterns

Python Pillow tutorial

Python Pillow

Python Turtle tutorial

Python Turtle

Keras tutorial

Preparation

Aptitude

Verbal Ability

Interview Questions

Interview Questions

Company Interview Questions

Company Questions

Trending Technologies

Artificial Intelligence

Artificial Intelligence

AWS Tutorial

Data Science

Angular 7 Tutorial

Machine Learning

DevOps Tutorial

B.Tech / MCA

DBMS tutorial

Data Structures

DAA tutorial

Operating System

Computer Network tutorial

Computer Network

Compiler Design tutorial

Compiler Design

Computer Organization and Architecture

Computer Organization

Discrete Mathematics Tutorial

Discrete Mathematics

Ethical Hacking

Ethical Hacking

Computer Graphics Tutorial

Computer Graphics

Software Engineering

Software Engineering

html tutorial

Web Technology

Cyber Security tutorial

Cyber Security

Automata Tutorial

C Programming

C++ tutorial

Control System

Data Mining Tutorial

Data Mining

Data Warehouse Tutorial

Data Warehouse

RSS Feed

case study of yacc

Welcome back.

case study of yacc

Continue with email  

  • Lex is officially known as a "Lexical Analyser".
  • Its main job is to break up an input stream into more usable elements. Or in, other words, to identify the "interesting bits" in a text file.
  • For example, if you are writing a compiler for the C programming language, the symbols { } ( ); all have significance on their own.
  • The letter a usually appears as part of a keyword or variable name, and is not interesting on its own.
  • Instead, we are interested in the whole word. Spaces and newlines are completely uninteresting, and we want to ignore them completely, unless they appear within quotes "like this"
  • All of these things are handled by the Lexical Analyser.
  • A tool widely used to specify lexical analyzers for a variety of languages
  • We refer to the tool as Lex compiler, and to its input specification as the Lex language.

enter image description here

Lex specifications:

A Lex program (the .l file) consists of three parts:

declarations

translation rules

auxiliary procedures

The delarations section includes declarations of variables, manifest constants (A manifest constant is an identifier that is declared to represent a constant e.g. # define PIE 3.14).

and regular definitions.

The translation rules of a Lex program are statements of the form :

p1 {action 1}

p2 {action 2}

p3 {action 3}

Where each p is a regular expression and each action is a program fragment describing what action the lexical analyzer should take when a pattern p matches a lexeme. In Lex the actions are written in C.

  • The third section holds whatever auxiliary procedures are needed by the actions. Alternatively these procedures can be compiled separately and loaded with the lexical analyzer.

How does this Lexical analyzer work?

  • The lexical analyzer created by Lex behaves in concert with a parser in the following manner.
  • When activated by the parser, the lexical analyzer begins reading its remaining input, one character at a time, until it has found the longest prefix of the input that is matched by one of the regular expressions p.
  • Then it executes the corresponding action. Typically the action will return control to the parser.
  • However, if it does not, then the lexical analyzer proceeds to find more lexemes, until an action causes control to return to the parser.
  • The repeated search for lexemes until an explicit return allows the lexical analyzer to process white space and comments conveniently.
  • The lexical analyzer returns a single quantity, the token, to the parser. To pass an attribute value with information about the lexeme, we can set the global variable yylval.
  • e.g. Suppose the lexical analyzer returns a single token for all the relational operators, in which case the parser won’t be able to distinguish between ” <=”,”>=”,”<”,”>”,”==” etc. We can set yylval appropriately to specify the nature of the operator.

Note: To know the exact syntax and the various symbols that you can use to write the regular expressions visit the manual page of FLEX in LINUX :

The two variables yytext and yyleng

Lex makes the lexeme available to the routines appearing in the third section through two variables yytext and yyleng

yytext is a variable that is a pointer to the first character of the lexeme.

yyleng is an integer telling how long the lexeme is.

A lexeme may match more than one patterns. How is this problem resolved?

  • Take for example the lexeme if. It matches the patterns for both keyword if and identifier.
  • If the pattern for keyword if precedes the pattern for identifier in the declaration list of the lex program the conflict is resolved in favour of the keyword.
  • In general this ambiguity-resolving strategy makes it easy to reserve keywords by listing them ahead of the pattern for identifiers.

The Lex’s strategy of selecting the longest prefix matched by a pattern makes it easy to resolve other conflicts like the one between “<” and “<=”.

In the lex program, a main () function is generally included as:

Here filename corresponds to input file and the yylex routine is called which returns the tokens.

  • Yacc is officially known as a "parser".
  • It's job is to analyse the structure of the input stream, and operate of the "big picture".
  • In the course of it's normal work, the parser also verifies that the input is syntactically sound.
  • Consider again the example of a C-compiler. In the C-language, a word can be a function name or a variable, depending on whether it is followed by a (or a = There should be exactly one } for each { in the program.
  • YACC stands for "Yet another Compiler Compiler". This is because this kind of analysis of text files is normally associated with writing compilers.

enter image description here

How does this yacc works?

  • yacc is designed for use with C code and generates a parser written in C.
  • The parser is configured for use in conjunction with a lex-generated scanner and relies on standard shared features (token types, yylval, etc.) and calls the function yylex as a scanner coroutine.
  • You provide a grammar specification file, which is traditionally named using a .y extension.
  • You invoke yacc on the .y file and it creates the y.tab.h and y.tab.c files containing a thousand or so lines of intense C code that implements an efficient LALR (1) parser for your grammar, including the code for the actions you specified.
  • The file provides an extern function yyparse.y that will attempt to successfully parse a valid sentence.
  • You compile that C file normally, link with the rest of your code, and you have a parser! By default, the parser reads from stdin and writes to stdout, just like a lex-generated scanner does.

Difference between LEX and YACC

  • Lex is used to split the text into a list of tokens, what text become token can be specified using regular expression in lex file.
  • Yacc is used to give some structure to those tokens. For example in Programming languages, we have assignment statements like int a = 1 + 2; and i want to make sure that the left hand side of '=' be an identifier and the right side be an expression [it could be more complex than this]. This can be coded using a CFG rule and this is what you specify in yacc file and this you cannot do using lex (lexcannot handle recursive languages).
  • A typical application of lex and yacc is for implementing programming languages.
  • Lex tokenizes the input, breaking it up into keywords, constants, punctuation, etc.
  • Yacc then implements the actual computer language; recognizing a for statement, for instance, or a function definition.
  • Lex and yacc are normally used together. This is how you usually construct an application using both:
  • Input Stream (characters) -> Lex (tokens) -> Yacc (Abstract Syntax Tree) -> Your Application
  • 90% Refund @Courses
  • Engineering Mathematics
  • Discrete Mathematics
  • Operating System
  • Computer Networks
  • Digital Logic and Design
  • C Programming
  • Data Structures
  • Theory of Computation
  • Compiler Design
  • Computer Org and Architecture

Related Articles

  • DSA to Development
  • Common Sub Expression Elimination
  • Dead Code Elimination
  • Copy Propagation
  • Various Data Structures Used in Compiler
  • Constant Folding
  • Iterative algorithm for a backward data flow problem
  • Reachability in Compiler Design
  • Function Inlining
  • What is Code Motion?
  • Pass By Copy-Restore in Compiler Design
  • What is Top-Down Parsing with Backtracking in Compiler Design?
  • Induction Variable and Strength Reduction
  • What is Error recovery
  • What is USE, IN, and OUT in Compiler Design?
  • Global Code Scheduling in Compiler Design
  • CLR Parser (with Examples)
  • Region Based Analysis
  • Error Recovery in Predictive Parsing
  • Activation Records

What is LEX in Compiler Design?

In this article, we will understand what is Lex in Compiler Design but before understanding Lex we have to understand what is Lexical Analysis.

Lexical Analysis

It is the first step of compiler design , it takes the input as a stream of characters and gives the output as tokens also known as tokenization. The tokens can be classified into identifiers, Sperators, Keywords , Operators, Constant and Special Characters.

It has three phases:

  • Tokenization: It takes the stream of characters and converts it into tokens.
  • Error Messages: It gives errors related to lexical analysis such as exceeding length, unmatched string, etc.
  • Eliminate Comments: Eliminates all the spaces, blank spaces, new lines, and indentations.

Lex is a tool or a computer program that generates Lexical Analyzers (converts the stream of characters into tokens). The Lex tool itself is a compiler. The Lex compiler takes the input and transforms that input into input patterns. It is commonly used with YACC (Yet Another Compiler Compiler). It was written by Mike Lesk and Eric Schmidt.

Function of Lex

1. In the first step the source code which is in the Lex language having the file name ‘File.l’ gives as input to the Lex Compiler commonly known as Lex to get the output as lex.yy.c.

2. After that, the output lex.yy.c will be used as input to the C compiler which gives the output in the form of an ‘a.out’ file, and finally, the output file a.out will take the stream of character and generates tokens as output.

lex

Block Diagram of Lex

Lex File Format

A Lex program consists of three parts and is separated by %% delimiters:-

Declarations: The declarations include declarations of variables.

Transition rules: These rules consist of Pattern and Action.

Auxiliary procedures: The Auxilary section holds auxiliary functions used in the actions.

For example:

Whether you're preparing for your first job interview or aiming to upskill in this ever-evolving tech landscape, GeeksforGeeks Courses are your key to success. We provide top-quality content at affordable prices, all geared towards accelerating your growth in a time-bound manner. Join the millions we've already empowered, and we're here to do the same for you. Don't miss out - check it out now !

Looking for a place to share your ideas, learn, and connect? Our Community portal is just the spot! Come join us and see what all the buzz is about!

Please Login to comment...

author

  • Top 12 AI Testing Tools for Test Automation in 2024
  • 7 Best ChatGPT Plugins for Converting PDF to Editable Formats
  • Microsoft is bringing Linux's Sudo command to Windows 11
  • 10 Best AI Voice Cloning Tools to be Used in 2024 [Free + Paid]
  • 10 Best IPTV Service Provider Subscriptions

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

IMAGES

  1. Case Study of YACC

    case study of yacc

  2. An overview of yacc's architecture and subsystem structure.

    case study of yacc

  3. YACC in Compiler Design

    case study of yacc

  4. Introduction to LEX and YACC. First step towards creating your own

    case study of yacc

  5. Lex & Yacc

    case study of yacc

  6. PPT

    case study of yacc

VIDEO

  1. Admission Motivation Shawon Reza #hsc25#hsc24 #hsc23 #apar#engineering #du#buet

  2. tinkiri and saarawita rap Remix (ටින්කිරි rap and සාරවිට rap)

  3. Pavel Haas

  4. MTCH News along with Price and Volume Analysis MTCH Stock Analysis $MTCH Latest News TickerDD MTCH P

  5. 90 minute Soft Jazz Pomodoro Timers with 10 minute breaks

  6. कॉलेज वाली मसालेदार होती है|#short#shorts

COMMENTS

  1. Introduction to YACC

    YACC (yet another compiler-compiler) is an LALR (1) (LookAhead, Left-to-right, Rightmost derivation producer with 1 lookahead token) parser generator. YACC was originally designed for being complemented by Lex. Input File: YACC input file is divided into three parts. /* definitions */ .... %% /* rules */ .... %% /* auxiliary routines */ ....

  2. Introduction to LEX and YACC

    Introduction to LEX and YACC Umangshrestha · Follow Published in FAUN — Developer Community 🐾 · 6 min read · Jul 27, 2021 First step towards creating your own compiler Lex and YACC are popular tools used in creating compiler. In this article we will discuss about the tools and how to use it from beginners prospective.

  3. PDF LEX & YACC Tutorial

    In yacc file, your statement and expressions should be 'ast' type (no longer dval type). if you use union here, It's easier to handle the terminal nodes (name and numbers) struct EXP* exp2; struct OP operator; ... A case study - The Calculator %

  4. Case Study Of Lex And Yacc Tools

    Case Study 1B — C front-end (Lex and Yacc) The purpose of this case study is to give an example of a compiler/interpreter front-end written in C using Lex and Yacc. An interpreter is...

  5. YACC

    YACC is a program designed to compile a LALR (1) grammar. It is used to produce the source code of the syntactic analyzer of the language produced by LALR (1) grammar. The input of YACC is the rule or grammar and the output is a C program. These are some points about YACC: Input: A CFG- file.y Output: A parser y.tab.c (yacc)

  6. Rapid implementation of SQL: a case study using YACC and LEX

    Rapid implementation of SQL: a case study using YACC and LEX KAIZAD B HEERJEE and RUBIK SADEGHI* Abstract: YACC and LEX are two powerful Unix tools, that are largely ignored by all but compiler writers, indeed, while considerable time and effort is being devoted to software reuse, little immediate interest has yet been raised on the part of the ...

  7. PDF by Tom Niemann

    Johnson [1975] published papers on lex and yacc. These utilities greatly simplify compiler writing. Implementation details for lex and yacc may be found in Aho [2006]. Flex and bison, clones for lex and yacc, can be obtained for free from . GNU. and . Cygwin. Cygwin is a 32-bit Windows ports of the GNU software. In fact Cygwin is a port of the Unix

  8. PDF A Guide to Lex & Yacc

    Yacc generates C code for a syntax analyzer, or parser. Yacc uses grammar rules that allow it to analyze tokens from lex and create a syntax tree. A syntax tree imposes a ... the longest match wins. In case both matches are the same length, then the first pattern listed is used. Input to Lex is divided into three sections, with %% dividing the ...

  9. PDF yacc

    yacc then turns the specification into a C-language function that examines the input stream. This function, called a parser, works by calling the low-level scanner. The scanner, called a lexical analyzer, picks up items from the input stream. The selected items are known as tokens. Tokens are compared to the input construct rules, called ...

  10. Write text parsers with yacc and lex

    Generating the C source. To generate the C source that will actually parse some input text, run lex (or flex) on the file shown in Listing 1. Lex/flex files have the dot-suffix of 'l', so the above file could be called exampleA.l. To generate the C source: $ flex exampleA.l.

  11. PDF Yacc: Yet Another Compiler-Compiler

    Yacc provides a general tool for describing the input to a computer program. The Yacc user specifies the structures of his input, together with code to be invoked as each such structure is recognized. Yacc turns such a specification into a subroutine that han-dles the input process; frequently, it is convenient and appropriate to have most of the

  12. An empirical evaluation of Lex/Yacc and ANTLR parser generation ...

    As shown in Table 1, most courses (13 out of 16) use variants of the classic Lex/Yacc approach: Bison is the GNU implementation of Yacc; Flex is a free and open-source version of Lex; OCamlLex, OCamlYacc, and Menhir are Lex/Yacc implementations for the OCaml programming language; ML-Lex and ML-Yacc for standard ML; and JFlex and CUP for Java.Only the Computer Language Engineering course ...

  13. lex and yacc: Tools Worth Knowing

    yacc is a language parser. It accepts word items and, given a list of rules describing how these items form larger entities, deduces which items or combinations of items satisfy a rule. This can also be thought of as grammatical analysis. Once a rule is satisfied, a programmer's code is applied to the result.

  14. Rapid implementation of SQL: a case study using YACC and LEX

    Abstract. YACC and LEX are two powerful Unix tools, that are largely ignored by all but compiler writers, indeed, while considerable time and effort is being devoted to software reuse, little immediate interest has yet been raised on the part of the ordinary Unix user. This paper demonstrates how a subset of the SQL query language has been ...

  15. Example program for the lex and yacc programs

    y.tab.c The C language source file that the yacc command created for the parser y.tab.h A header file containing define statements for the tokens used by the parser Process the lex specification file: lex calc.lex Use the ls command to verify that the following file was created: lex.yy.c

  16. Compiler Construction/Case Study 1B

    Case Study 1B - C front-end (Lex and Yacc) The purpose of this case study is to give an example of a compiler/interpreter front-end written in C using Lex and Yacc. An interpreter is used since it allows a working program to be created with minimal extra effort (after the construction of the front-end). This code could be developed into a ...

  17. Flex (Fast Lexical Analyzer Generator )

    Courses. FLEX (fast lexical analyzer generator) is a tool/computer program for generating lexical analyzers (scanners or lexers) written by Vern Paxson in C around 1987. It is used together with Berkeley Yacc parser generator or GNU Bison parser generator. Flex and Bison both are more flexible than Lex and Yacc and produces faster code.

  18. parsing

    Is it possible to produce a small working example of a parser generator, using yacc, without relying on a lexer spec? Most textbook parser specification relies on a lexer, which makes the parser example a bit complex for students to grasp (imho). ... In the simplest case you can write a trivial lexer that just reads single character tokens from ...

  19. LEX

    LEX with introduction, Phases, Passes, Bootstrapping, Optimization of DFA, Finite State machine, Formal Grammar, BNF Notation, YACC, Derivation, Parse Tree, Ambiguity, Syntax directed Translation, slr 1 parsing etc.

  20. LEX and YACC.

    A typical application of lex and yacc is for implementing programming languages. Lex tokenizes the input, breaking it up into keywords, constants, punctuation, etc. Yacc then implements the actual computer language; recognizing a for statement, for instance, or a function definition. Lex and yacc are normally used together.

  21. What is LEX in Compiler Design?

    - GeeksforGeeks What is LEX in Compiler Design? Read Courses In this article, we will understand what is Lex in Compiler Design but before understanding Lex we have to understand what is Lexical Analysis. Lexical Analysis

  22. Case Study of YACC

    System software seminar