Iteration 1

| No Comments

Hey All!

If you are enrolled in CSCI 3081W at the University of Minnesota most of this will sound familiar and if not I hope this gives you some insight into the blunders and successes of implementation techniques used in computer programing. For those that are not familiar with the project going on in the class (3081W) we are building a translator that takes pseudo code created by our professor and translates it into C/C++. Sounds cool right? Well it will when it's done, but for now we are at the beginning stages of the translator project.

For several weeks on every Thursday night in class we have discussed implementation strategies and how to properly use subversion control. These two nuggets of knowledge are in some senses broad but once you understand the main concepts of both they become very useful in the construction of any computer science project. The first facet - implementation - is paramount to the success of any project computer science related or not. We talked about iterative and waterfall models and saw the benefits of both but moreover the cons that each presented. For many projects outside the realm of computer science we see the waterfall method used which is composed of completing each "stage" - Conception, Initiation, Analysis, Design Construction, Testing, Production and Maintenance - before moving onto the next "stage." For many clearly defined endeavors this is a great process for completing the task at hand. However, for computer science issues the iterative method proves to be more successful. The iterative method involves all the same "stages" as the waterfall method, yet each stage does not have to be completely done before moving onto the next. This brings us to the title of this blog entry: Iteration 1.

In this iteration of the project we begin by building a scanner that will take the pseudo code read it into the scanner and output a list of tokens that represent the characters in the pseudo code file. If you are not familiar with the project you are probably a little confused, but as I begin to explain our implementation hopefully the basics of iteration 1 will become clearly visible.

We began with a template of a scanner and translator that we worked on during labs 1, 2 and 3, and then we were given sample files that contained the pseudo code files. During these labs we learned about the scanner itself and regular expressions, which I will refer to as regexes for the remainder of this blog. Regexes contain two things: strings called lexemes that represent the character we wish to match and a specific tokens to which the lexeme is set to. For example to represent a left curly brace the lexeme would be "^{" and the token could be leftCurlyRegex.

Having two crucial parts of iteration 1 identified Cord and I decided to split the work accordingly. He would work on the scanner and I would work on the regexes. Now this doesn't mean we went our separate ways after this and did the work assigned. Instead, we worked side by side through 95% of our entire creation of iteration 1. Cord would be on this Mac and I would be beside him on my laptop using Linux to "ssh" into the lab machines. This was a very critical part of how we as a group decided to work on the project. By sitting side by side we could discuss how far we were and if we hit a hiccup we immediately had another set of eyes to look at the code. The latter part of that last sentence is crucial because I have found that many times when one person sits and writes code they begin to get tunnel vision and can't seem to see their errors or find solutions to what may be a simple problem. We figured that although being together for the entire iteration's development would be inconvenient on some levels, but it would save us time and hassle in the long run.

The other benefit of being together was the ease of using subversion. We ran into very little merge errors or conflicts with subversion, because whenever one person would commit or begin editing a new file we would discuss it with each other. Especially during committals we would make sure the other person would instantly run an update in order to have up-to-date files. Also, whenever we did encounter a merge conflict we both went to one of our laptops and went through together to fix the conflict.

Now onto the actual implementation. I will talk in depth about the regex development and testing process and then bring it full circle with how we interlinked the scanner with the regexes. In the beginning stages of the regex development I took and brainstormed how I was going to create and test all 50+ regexes. My first thought was to write each regex down in the order they were given to us (beginning with the key words and ending with the operators). I then decided it would be best to look at each of the sample files and create the regexes as I encountered them in the sample files. This worked fairly well as I would write the regex on paper, code it into the scanner.cpp file, and then write a test for it in regex_tests.h. I had to look online to obtain a specific list of regex characters, what they did, and how to use them properly. This helped immensely except for one hiccup I encountered. This slight problem occurred when I decided to used dollar signs ($) at the end of each regexes lexeme to signify the end of the string that the regex would match. I wrote simple tests that were independent of the scanner in regex_tests and they all passed with the dollar sign included in the lexeme. Here is my first mistake. I should have wrote more complex tests to see if the regex would match and I didn't. With that said after all the regexes were built and my partner completed the scanner objects we began to run tests dependent on the scanner and regexes working together and none passed. It was pretty obvious where the error was and it was in the regex definitions.

We both took a look at the regexes I had built and brainstormed what could be the issue. After little debate we settled on the fact that having the dollar sign at the end of the lexemes is the most likely culprit to our errors. Sure enough we went online and did some more research on exactly what the dollar sign does and it signifies the end of the line. This means that after the regex is matched it say that there is nothing more to look at on that line and hence does not allow the scanner to identify and characters passed those recognized by the first regex on that line. It took a little time but I went back and fixed these simple errors by removing the dollar sign and all but one regex test passed.

This one error we encountered would be one of the last for iteration one, but also one of the longest to solve. This second error was occurring in the string constant regex, which as the name implies matches a string. We couldn't produce the correct regex that would match any string. Our error was coming in that we could match the first quotation mark but we couldn't match the second so that it would end the string. After looking online at regex help sites and many tests we finally came up with a solution that worked.

My partner ran our Makefile on his Mac and it said we passed all scanner_tests and regex_tests! He committed and I updated and ran the same makefile and to our surprise had a segmentation fault occur in scanner_tests. We were baffled. At first we decided that I should dump my group repository and checkout a fresh one, because there could be the possibility of a merging error that was causing the segmentation fault. I did this to no avail - the segmentation fault remained. We began to look through the scanner files and couldn't find anything blatantly apparent that was wrong. What made matters worse is that it would run and all tests would pass on my partners Mac. We finally came to the conclusion that we need to use a debugger, in this case GDB, to determine exactly where this segmentation fault was occupying. The debugger was a huge help and would of saved us hours of aimlessly searching through our code if we would have used it right away. It told us exactly what line the error was occurring and it happened to be once again in scanner.cpp in our regex definitions. After looking at the regexes for a while it wasn't apparently obvious why they were creating the fault. They seemed right and they passed the individual regex_tests that I had tried earlier. Then all of a sudden when looking at the not equals regex it hit me. We had not escaped some characters that needed to be escaped because they are metacharacters: characters that regexes uses to distinguish specific things within the lexemes. Sure enough we put double backspaces in several of the lexemes and ran the tests and the segmentation fault disappeared! All the tests passed and all that was left to do was add onto our makefile the iteration 1 assessment tests and the eve of destruction tests that we ran opted not to do due to time constraints.

I edited the makefile while my partner went in and cleaned up the scanner code he had written adding comments and such. After completion of both of these tasks we ran the makefile again and all tests successfully passed. We were happy to say the least.

Now let me back up a little bit to explain how Cord and I decided to implement the scanner methods (which my partner wrote the majority of). The idea was to use an array that would hold all the regexes and then pass the characters withing the files our scanner encountered through the entire array. We then had to develop a way to have certain regexes that would both match the same character(s) behave correctly. This idea stemmed heavily from precedence. We thought of two ways to do this. First, we could put each regex into the array in a way that it would encounter the correct regex first before it would reach a regex that would match but was not the correct regex. Second, we determined that we could accomplish correct precedence if we used a counter that kept track of the matched characters and always chose the greater of two matched character sequences. We chose to use the latter implementation of regex precedence.

We developed a matched characters counter which kept track of the matched characters and compared that to the matched character counter of ever other regex the characters had had a match with. We then chose the greatest matched character regex and placed it into an output array.

Next, we had to develop a way in which the output array would hold all the regexes from the file we scanned in a way that was linked and in some order. We decided to use an array and a pointer object called "next." The way the array worked is every regex match was pushed into the front of the array and any regex in the array before was moved down one position. After the end of file regex was placed into the array a recursive call would set the next pointer on every element to the one preceding it creating a linked array of regexes.

With that said my partner knows more about the subject than I do, just how I know slightly more than he does about regexes. I wish this wasn't the case and it is both our goals to work more jointly on each aspect of future iterations. If we can manage to work on the same files jointly we will both maintain a greater understanding of the overall project which will surely benefit us down the line in this class and future classes.

On that same note, we as a group need to improve our testing process. We were to quick to write simple tests that tested only the basic premise of the code we wrote and had errors as a result. For future iterations we will need to have rigorous tests, as well as simple tests to ensure snippets of our code works properly before moving on. With that said, we should have tested more frequently throughout the entire first iteration. If we would have tested our scanner along with some regexes early on we may have realized the dollar sign issue, the segmentation fault error, and escaped character problem.

Overall, I think we managed our time well and working side-by-side is something I highly suggest. We also used the Piazza forums regularly to help answer some of our questions and many times we would look on the forum and a question we had was already answered. Piazza should be used by everyone and I hope it's used more on future iteration because we definitely appreciated those who posted replies and questions. Also, correct use of version control is paramount. Without it, it would be difficult to collaborate with your partner, and at the same time using it improperly can make things even worse! Berry's "Pain Paper" is absolutely right when it comes to learning how to do some of the initially "painful" facets of computer science well (ie: subversion) because if you don't it could cause for major headaches later on. Lastly, always always check that your code compiles on a lab machine. It I hadn't ssh'ed into a lab machine and compiled our code and ran the tests we may have scored close to a zero on the machine portion of iteration 1. With that I am signing off for today. If you have any comments post them below or feel free to email me at giaim003@umn.edu. Thanks!

- Tommy Giaimo

Leave a comment

About this Entry

This page contains a single entry by giaim003 published on March 5, 2012 5:19 PM.

Find recent content on the main index or look in the archives to find all content.