22 March 2020

Resurrecting Drezn Content, part II

I didn’t do much more work on pulling content out of the archived, two-decade old files than I did the other night or when I wrote the article yesterday. By the time I wrote about it I’d exhausted pretty much everything I could think of that was reasonable to try.

Before completely giving up and manually picking apart the files using strings (a Unix tool that streams through a file and prints out any text it finds amongst the binary gibberish) or a hex editor I decided to post a question about it on Server Fault, a site for systems administrators to help eachother figure out tough problems. Oddly, there are few problems listed there in this space.

While I was describing the problem, and what I’d done to try and solve it, I reproduced the Python result I’d gotten, which was on the 64-bit VM. One of the last things I noticed the night I was trying this all out was that the PHP script failed with a different error message on the 32-bit VM. I hadn’t jumped on this at the time because I was tired and I’d rebuilt the various softwares on both platforms multiple times in different versions and combinations, and that could have easily led to the inconsistency in error messages, one way or another. So I basically wanted to confirm I got the same error on the default Python installed with Warty on the 32-bit version.

I didn’t. Instead of an error I got a database handle.

Python 2.3.4 (#2, Aug 13 2004, 00:36:58)
[GCC 3.3.4 (Debian 1:3.3.4-5ubuntu2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import gdbm
>>> db = gdbm.open('drezn.net/db/article.db', 'r')
>>>

Lol whut!

So it turns out I was right to be suspicious of 32-bit vs. 64-bit, as well as, probably, different versions of GDBM, though I haven’t confirmed that. (I came across some anecdotal evidence that GDBM is not portable across platforms and architectures.) Yay.

Now I have some fun work to do: figure out how best to manage the data I pull out. The simplest is probably to dump everything out as XML. This should make it easy to keep data and content separated, and structured, without having to worry much yet about what I actually want to do with the data. It’s mostly just the blog articles, but there was a commenting ability and user authentications.

People hate on XML but I think it’s useful and has its place, and I can’t think of a better format for this use case. Frankly though I should have used archived this crap that way in the first place. I knew better. (There’s a small possibility I did, and that as soon as I finish this conversion, I’ll come across a directory with three XML files in it.)

Once I have XML it’s simple enough to work from there. I could brush off my XSLT knowledge and use that to convert it to Markdown, for example, and integrate it with the rest of the resurrected site.

First, though, let’s have a look at what’s in here.

I remembered that I’d used predictably named keys to be able to pull the desired article components, and printing out the keys tells me at a glance what’s in there.

Here’s the Python script:

#!/usr/bin/python

import gdbm
import sys

try:
  file = sys.argv[1]
except IndexError:
  print("Specify file")
  sys.exit(1)

db = gdbm.open(file, 'r')
k = db.firstkey()
while k:
  print(k)
  k = db.nextkey(k)
db.close()

The keys are not in order so I sort the output and have a look. It’s very simple and I see everything I need in about 10 seconds. Maybe 20.

$ ./dump-drezn.py drezn.net/db/article.db | sort | less
...
article_9_Body
article_9_Category
article_9_Date
article_9_Synopsis
article_9_Title
__latest

Everything is article_$<idx>_<field>. The Synopsis field is optional. The Date field appears to be a standard Unix timestamp. __latest contains the index of the newest article. There is nothing sophisticated here.

After a few iterations I wind up with a serviceable XML output. It’s somewhat gross with the edge cases, as these sorts of one-offs so often are. But it’s working and it’s enough that I produce the XML file for the articles.

Next I have to figure out what to do with the other two database files: comments and users. The user file I will probably ignore but the comments file, we’ll see. Re-publishing friends’ comments from back then seems definitely iffy, so I’ll probably keep them to myself.

That’s leaving aside the bigger question of curation and editing. That can wait for another day.

I archived all my Facebook posts, and even Twitter for the brief time I tried that nonsense out. I’ve thought about pulling that in as well, which always leads to an internal dialogue about privacy and how I could lock down some articles to known entities, via OpenID or some such.

I have way too many dorky projects on the go.

Comments

JavaScript must be enabled for comments.