Homework 8: Regular Expressions and Text Processing

Due Monday, April 2, by 10:30 AM


As usual, place the commands you used to solve each of the 5 problems into a file called homework8
  1. Recall the word finding demo from in-class. Use the dictionary located at /usr/share/dict/ called american-english to find a list of words which contain an r (upper or lower case), followed by two letters (any letter), followed by a k (upper or lower case). Also, allow one additional letter on either end of the pattern. So some words that match are wrack, trick, rank, ranks, crocks, etc. Words with more letters on either end, like rocket or windbreak should not match.
  2. Use grep and wc to figure out how many words are exactly 10 letters long. Exclude words with an apostrophe (single quote).
  3. Two common three-letter sequences to find at the beginning of English words are non (such as nonsense, none, or nonstop) and pre (such as predict or preface or predestination). Use grep and wc to figure out which sequence is more common. This will be easiest with two commands, one for non and one for pre.
  4. The Apache web server log in /var/log/apache2 contains a great deal of text, which can be difficult to read. Notice that requests for web pages contain GET. Use grep and cut to display ONLY web pages requested, not any of the other information in the file. Sort out other types of queries such as POST. I suggest starting with grep to select only lines containing GET, and then using cut with a space as the delimiter, but the method is up to you.
  5. isoptera hosts webpages for multiple classes at LCSC: cs101, cs228 (which you are taking), cs430, and cs440. Using the result of the previous problem, further sort out only lines which contain a course name. Then, use sed to replace course numbers with names: cs101(CSSeminar), cs228(Linux), cs430(OperatingSystems), and cs440(ArtificialIntelligence). So an entry such as "/cs430/p1.html" becomes "/OperatingSystems/p1.html", and "/cs228/syllabus.pdf" becomes "/Linux/syllabus.pdf". If you feel the pipe has become too long to manage, you can save an intermediate result (such as the result of question 4) in a file and work from there. Remember, > can be used to redirect output into a file.
Turn in your homework by creating a file named "homework8" with no extension in your home directory, with permissions set so that only you can view the file. This assignment is worth 35 points, 5 from correct naming, 5 from correct permissions, and 5 from each of the problems above. As a reminder, you can set the permissions correctly with this line:
chmod 600 homework8