Welcome to this evening/morning’s quiz, folks. Here is the one and only question: what happens when you don’t make a good database/data storage design for a project up front? A: The last 4 hours of my life.
As it turns out, there is a bit of strangeness in the wtry land. See, part of the requirements were that files (including logs) had to be stored as files in the filesystem, not entires in a database somewhere. This wasn’t too big of a deal, but what I decided to do with that requirement is where I got in trouble.
As a quick little sidebar, wtry is a submission system for the CS department at RIT. It allows students to submit files/code and have those files saved and optionally compiled/run. It also has some minimal support for grading. On to the fun.
So, there were 8 tables in the database (before today): courses, assignments, tasks, graders, courseregs, submissions, grades, and extensions. There are some things that aren’t in the database at all, like files that are associated with a task (files provided for student download, solution files provided to graders for grading, and “provided” files that are provided to the compilation/run step). So, without thinking about it too much, I didn’t store any information about the logs in the database, thinking it would be the same as these other files. The problem? Log files are associated with a combination of course and task.
So log files were stored in directories something like this:
LOGDIRECTORY/courseid/assignmentid/studentid/taskid.log
(yeah, yeah, the assignment was extra, I actually really don’t remember why I put it there, since tasks belong to exactly one assignment).
So here’s the problem. I need to delete the log whenever either:
- I delete the course
- I delete the task (either by deleting the task directly or deleting the assignment, which deletes all of its tasks)
Again, as a quick little sidebar; for those of you who are wondering, assignments and tasks have a n:n relationship. An assignment can be for multiple courses (i.e. all lab sections of CS1), and a course can have multiple assignments, none of which are required to relate to each other. This is just in case you might wonder why deleting a course didn’t delete all of its assignments.
Well, this becomes a bit of a problem when I need to, say, delete a task. First off, there is no list of where I should find them, so in order to delete all logs for a given task, I would have to enumerate through the entire log directory. Now, wtry has only been in use for about a month, and has only been used in one course, with around 30-50 students. There are already 230 logs (I only know because they now have database entries). Let’s say wtry was being used by, oh, 6 different courses. That would be somewhere around 200-400 students (depending on the course/how many sections it has). In the space of a quarter, there are usually 10 assignments, and each assignment probably has (at a low estimate) 3 tasks with submissions (that get logged). That means we have, at a low estimate, 12,000 log files. Enumerating through 12,000 files is not my idea of a party.
Enter the refactoring. First add a table to the database, `logs`, that has `studentid`,`courseid`, and `taskid` (which, together, form a UNIQUE or PRIMARY key). Then, change the Log class to INSERT that data into the database whenever it creates a new log. Then, create a nice little script that enumerates through all the logs and INSERTs their data into the database. Finally, since the directory structure sucked to begin with, we’re going to change it to:
LOGDIRECTORY/courseid/taskid/studentid.log
This has two gains over the old way: get rid of assignmentid, and make it so deleting a task’s worth of students is still as easy as removing the directory (may increase efficiency). So, the “finally” is to backup the log directory, write a script that copies the logs into the new directory structure, and make it the active log directory. Oh, and “finally-finally”, make sure the whole thing works. Yeah, that is pretty important as well.
I need to go to sleep. I’m sure you will all sleep better tonight knowing that it is up and running on the production server in a minimal sort of way: I didn’t change the function prototypes on the production server for log-stuff to get rid of $assignmentid, or on pages that used logs requiring a $_GET variable for $assignmentid. However, all of that has been changed in the CVS repository, to be pushed to the production server after some exciting testing and further development.
Have a good weekend, kids. And remember: don’t be dumb like Noah. Be smart. Very smart.