Another example of how I am applying Simplify, Standardize, Automate to my own stuff.
A lot of you know that, outside of my day job, I work on the FreeDOS Project. For a long time, I've had a kind of "blog" there where I'd write about various happenings, things that didn't really fit as a "news item" on the main page. This was a "blog" in the lowest sense - to add a new item to the page, I had to edit the page.
Just like how I used to do this blog, I created a simple shell script that automatically generated my FreeDOS "blog" pages based on posts I kept as html files in a particular directory. Recently, I decided the shell script was a good start, but what I really needed was a blogging system. Fortunately, SourceForge (where www.freedos.org is hosted) now offers application hosting for SF users. One of those applications is Wordpress, a popular blogging system.
Wordpress has a neat feature: categories. Using categories means I can use one blog to write about several different topic areas: FreeDOS updates, personal stuff, my work with Linux, etc.
But here's my problem: pretty much everyone on the FreeDOS Project would be uninterested in my personal stuff, most people who follow my Linux work won't want to hear about FreeDOS. Similarly, my friends and family aren't really DOS geeks, so they won't want to see the FreeDOS stuff either. That means no one will likely read my blog as a whole, just the bits they care about. Sure, people could simply follow my Wordpress blog for just the category they want to see - but as I mentioned earlier, I cannot change the (ugly) web design.
How to solve this? In the past, I've kept a separate web page for FreeDOS stuff, another for "friends and family", and another for Linux stuff. But now, I wanted to use Wordpress for blogging, rather than update web pages manually.
Scripting to the rescue! I wrote a shell script that scrapes my Wordscraper blog for a particular category, and displays it at each of my other web sites. Without posting the whole script (let me know if you want it), here's what happens:
- the blog category is fetched as an html file, using "wget"
- an in-line AWK script parses the html, to extract just the blog entries for that category
- another AWK script extracts the link(s) to the blog archives for that category
The tricky bit is that AWK script. Here it is for #2. Behold the power of AWK!
/<div class="post hentry category-/ { blog=1 }
/<div/ { if (blog==1) {div++} }
{ if ( (blog==1) && (div>0) ) {print} }
/<\/div/ { if (blog==1) { if (--div == 0) {blog=0} } }
I expect every person who works on the UNIX systems to understand what this means, as well as predict the output when run against this:
... </div><div class="post hentry category-uncategorized">
<h3 id="post-71"><a href="http:// ... /wordpress/jhall1/2009/02/05/test/" rel="bookmark" title="Permanent Link to test">test</a></h3>
<small>Thursday, February 5th, 2009</small><div class="entry">
<p>This is a test post to see if non-SF users can leave comments.</p><p><em>Update: looks like you need to have a SF account to leave comments.</em></p>
</div><p class="postmetadata"> Posted in <a href="http:// ... /wordpress/jhall1/category/uncategorized/" title="View all posts in Uncategorized" rel="category tag">Uncategorized</a> | <a href="http:// ... /wordpress/jhall1/2009/02/05/test/#respond" title="Comment on test">No Comments »</a></p>
</div>
<div class="navigation">
...
With this script in place, my personal blogs will magically appear at my "friends and family" web site, and my Linux comments will go to my "Linux" web site. And I just have to update a single blog.
This is the kind of scripting and level of automation that we need to use in Operations & Infrastructure, especially as we work to automate everything that we do.

How it works: In the html, the blog entries start with a div called "post hentry category-{category}". So for category "freedos", the div is called "post hentry category-freedos". Once I find that, I know I'm looking at the blog.
There are several divs inside the "post hentry category-{category}" div that contain the details of my blog posts. The /<div/ and /<\/div/ lines are essentially a "stack", so that every time I find a "<div>" start, I increment the stack, and every time I find a "</div>" I take one away from the stack.
That last line also says that when the stack is empty, I know I've exited the blog.
My web page (PHP) runs getblog.sh once an hour, and displays the output.
Found your blog when I was looking for this on google.
Will try the trick when I have some more time. Thanks.