AUTHOR'S MAIN SITE  »  TABLE OF CONTENTS for Open Office database tutorials.
Delicious.Com Bookmark this on Delicious    StumbleUpon.Com Recommend to StumbleUpon

Open Office Database Case Study
Designing a primary key to distinguish books in a collection

You may find that the database being shipped with OpenOffice (ver.2 and higher) delights you as much as it has me. This page tries to help you use it.

Forget anything you may have heard about Adabas, which came with Star Office, the commercial version of Open Office 1. The current Open Office's database, "Base", aka "ooBase", is unrelated. And remember that Open Office, including ooBase, is free! But don't let that fool you. And it's not new. Big organizations, government and civilian, are adopting it as their standard office suite... and saving million$, but still Getting The Job Done.

There's more about ooBase in the main index to this material.

This page is "browser friendly". Make your browser window as wide as you want it. The text will flow nicely for you. It is easier to read in a narrow window. With most browsers, pressing plus, minus or zero while the control key (ctrl) is held down will change the texts size. (Enlarge, reduce, restore to default, respectively.) (This is more fully explained, and there's another tip, at my Power Browsing page.)

Page contents © TK Boyd, Sheepdog Software ®, 9/12.



An example of a primary key... and matters arising along the way.

I have recently begun collecting rare books... but the following example covers ground which might apply to keeping track of many, many classes of objects.

For various reasons, not least so that I can build a well constructed database, I wanted a short, unique identifier for each of the volumes in my collection. If you read through this essay, it may help you design the rules for an identifier you need for some task you are undertaking.

In the beginning, there were serial numbers...

... even before there were databases!

I could just go through my collection, and number the first book I come to "001", the second "002", and so on.

That would meet the basic needs of the database, a unique identifier for each item covered... but it would be a very crude answer. We can do so much better. In this essay, I will try to show you how an answer could be better, and how to achieve such an answer.

By the way, if you like the scheme I describe, I have a Windows computer program you are welcome to use which generates identifiers as described. At them moment, to obtain it, you download a .zip file which not only has the program (CD93.exe), but also all of the files you would need if you wanted to use Lazarus to make changes to it... But you can download the .zip, extract just the .exe file, and use it, if you like. There's a screenshot and more details in the page explaining the programming. Don't be overwhelmed by the screenshot... very "busy"... but using the application is trivial.

Why a short identifier

First, just a quick note as to why I want a short, unique identifier for each book.

I need to keep track of various things associated with each book. When did I buy it? What did I pay? Where is the book? The first two can be managed by keeping the invoices for my books. The latter can be handled with a simple list.

But how to reliable link this invoice with that book?

Simple! Give each book a short unique ID, even if it is merely "001", "002"..., and mark (carefully!) the ID in the book and on the invoice, and then file the invoices in the order of the IDs. And, once the book has a unique ID, the "Where" list can include the unique ID, so that if you have more than one copy of, say, Thornton's Flora, you can tell which copy is in each of the two locations for which you have a copy recorded.

That's just a quick introduction to why having unique identifiers is useful, in general. If you want to use a database in more than a stumbling manner, you will have to have unique identifiers for each object recorded in the database, so that you can have "primary keys" for your tables. That's covered in more detail in essays I've written about using databases. Let's return to the current example of designing a unique identifier.

Why not just serial numbers?

What's so terrible about just using "001", "002", "003", etc?

The most glaring flaw is that if you have one book called "758" and another called "785", you can quite easily write one when you mean the other.

Long ago, someone invented the idea of a "checksum". If I add a "checksum" to each of the IDs mentioned above, they might become "758w" and "785f". "So?", you ask?

Those letters were not chosen at random. The "w" is what you get if you (or, more likely, you have your computer) apply a rule. If the same rule is given 785, then the rule generates an "f".

If you make a mistake when trying to write "758w" and write "785w", then any time that ID is presented to a computer which is checking, the computer will say "Something is wrong. You don't get a "w" from "785". So any bits of lysdexia can be caught.... if you use checksums!

Mitigating the weaknesses in serial numbers

Okay... so our unique identifier is going to have a checksum. (We'll go into the "rule" used later.

Let's see what we can do about two more shortcomings of serial numbers.

As your collection grows, you have to keep track of what the last serial you used was to know what the next one should be. This isn't always a problem, but it can be a nuisance.

In addition, a simple serial number, on its own, tells you very little.

I tend to create unique identifiers as follows, which partly answers both of those objections to simple serial numbers.

I'm writing the first draft of this on September 28, 2012. If I added two books to my collection today, the first might be given the ID "120928a", and the second "120928b".

Even with such a simple system, my unique identifier already tells me a bit about the book directly from the ID... the ID tells me when I put the book in the database.

.... and the system has two flaws....

Happily, I can live with both of those shortcomings, for this project! But you need to recognize them, and need to ask if either matters to the work you are doing... or, if not, is there a "work around" which will fix things for you?

As I am going to give each volume a separate ID, I suppose the day may come when I buy a 30 volume encyclopedia, and need 30 unique identifiers. What will I do? Assuming I haven't bought any books the previous day, I will log the first 26 volumes in as having been bought then, and take care of the other four as if bought on the day I'm doing the work.

In fact, I intend to assign IDs to my books in not by the date they were entered into my database but rather by the date I acquired the book. In some cases, not many, this will have been before 2001. I was going to reserve all of the IDs for 2000 to use with books acquired before 2001... but my nerve failed me, and I decided to encode all of the year into the unique identifier.

You could, of course, simply use "20120928" as the starting point for a book ID based on 2012, September, 28th day. And there will be times when that is exactly what you should do. But, other times, the simpler answer ("120928") would be fine.

It might seem that a lot of trouble is being taken to save just two characters in the unique identifier. Yes. And that trouble will be repaid.

Before we go on, note that for September we used "09", not just "9". If the day had been, say, the 5th, we would have used "05", not "5". Why? What date would "2012123" be? 23nd of January? 3rd of December? Anyway. It Just Is Better if the number of characters used for any of the parts of the unique identifier remains the same across all of the various IDs assigned.

Starting with "20120928" an ID based on 2012, September, 28th, we're going to do two things to shorten it.

In place of the "20" to indicate the century, we're going to use a letter. "a" for 18, "b" for 19, "c" for 20, etc. (Hence, there will usually be a "c" here.)

(P.S.: This part will actually usually be an "f".. this is due to a change from defining "a" as 18 to defining it as "15"... a change in the early stages of the programming, but after the bulk of the documentation was done. At this point, the rest of this document reflects the old plan... "c"=20)

And, more radically... but you soon get used to it... we're going to shrink the month indicator to just one character. "1" for January, "2" for February, etc... but what about October, November, December? We'll use "a", "b" and "c" for those months.

Just to relieve your mind, and stake a claim to not being completely obsessive about shortening ID strings, I will point out that the day... for which I propose using two characters... could have been encoded with "a", "b", "c", etc. for the first, the second, the third, etc days of the month. But remember that the first must be shown as "01", not just "1"!

A little thing... I recommend using lower case letters most of the time, e.g. use "a", not "A". Not only are lower case letters easier to type, especially if you are holding a book open with one hand while you try to enter data with the other, but they are also more readable. A "P" and a "B" are not so very different, visually. On the other hand, the shapes of a "p" and a "b" have differences lacking in the upper case alternatives.

So... the first book acquired on the 3rd of November, 2012, is going to be given the following ID: "c12b03a".

What if we don't know the date of acquisition exactly? As it happens, in this case, I am not going to use the following system, here... but you might want to use it for the work you are doing. If you know the year and month, but not the day, you could put "99" in as the day. If you knew the year, but not even the month, you could put "z" for the month, "99" for the day. There are usually ways to do things... but they always come with a price, and if you aren't careful, you only discover the price when you've already assigned more unique identifiers than you want to go back and revise. Remember... it won't be just the books that need the IDs in them changes, but also the invoices associated with the book, etc, etc.

What I am going to do about books for which I do not know the acquisition date is the following....

For recently acquired books, for which I know approximately when I acquired them, I will make up a "stand in" invoice, with a date of my guessing, but on the invoice I will make a note that the invoice is "a fake", and that the date is a guess.

For books acquired long ago... I have one that was given to me many years ago by my grandmother, who herself had had it for many years. I'd like to use the date she acquired it, but that datum is lost. For such books, I will use the date I assigned the unique identifier... but I will use "z" as the letter for the first such done on a given day, a "y" for the next, a "x" for the next, and so on. Again, a "fake invoice" will be prepared for the book, and what I know of its history can be recorded there. (The database will also have a table for machine readable notes about individual books, too, of course.)

Use the scheme, Luke....

So now we have a scheme for our unique identifiers....

cyymdds

The "cyymdd" part normally encodes the date a book was acquired. The last character in the unique identifier is a serial... "a" will be used when processing the first book of a given day's acquisitions, "b" for the second, and so on... with "z", "y", "x", "w", and "v" (only) reserved for books for which the date of acquisition isn't known, in which case the cyymdd part will be derived from the day the book was given its unique identifier.

The checksum

In addition to all of the above, we are going to add one more character, the checksum.

It will be a letter, but this time it may be upper case or lower case. Why?

The "letter" will actually be standing for a number. It might be that "a" stood for 1, "b" for 2, .... "z" for 26, "A" for 27, "B" for 28, and so on... but the detail doesn't matter. The reason why we will endure the hassle of a case sensitive datum here is that it reduces the chance of the checksum failing to catch an error on our part.

I'll spare you the detail of why having codes for more than 1-26 makes the checksum better, unless enough people write to ask why this is.

And, while we're being tricky... and all of this will be taken care of by the computer anyway... we will not use all of the letters available to us. We won't use, for instance, o or O, l or I. They are too easily confused with other characters. To you and me, whether something is an "oh" (O) or a zero (0) doesn't matter. But if you have the wrong one in the checksum's place, then what you've entered will be unacceptable. (Substitutions in other parts of the unique identifier string will also cause the checksum to say "error", but I think it reasonable to hope that users will use the right characters in other places. And if they don't, they will have to learn to!)

Lastly, a little twist

So far so good?

We're going to revise one part of our rules for creating unique identifiers for the books in my collection.

So far, you might think that we've defined the unique identifier as being made up as follows...

cyymddse

... where "e" stands for the error checking checksum. (Couldn't use "c", could I? Already used to mark the first character in the string, the one that tells us the century.)

I haven't, in fact, said that the checksum character should go onto the right hand end.

In fact, it should be pre-pended to all of the characters we discussed before we got to discussing the checksum character. The unique identifier format should in fact be....

ecyymdds

The reason is as follows...

It will help the database software which these IDs are going to be used in if differences between the IDs are common in the early (left hand) parts of the strings. In a "perfect, for the database, world", we'd code the day as a letter, as discussed before, and make the string "edmyycs", but that would make human use of the strings tedious. Having the first character highly variable will be a big help to the database, at little cost to the human. After that, the database's wants can defer to the human's convenience.

Which may leave you with a question:

"Why do the date YY-M-DD? Why not the DD-M-YY we are more familiar with, or, M-DD-YY?"

Here's the answer to that one... and it pertains to many places where you are creating names from dates.

If you use yymdd, then if you have a bunch of records (or files, or anything else) with yymdd IDs or names, then a simple alphabetical sort, even outside of a database, of the IDs or names also gives you a chronological sort.

Don't forget,, if you like the scheme I describe, I have a (free) Windows computer program you are welcome to use which generates identifiers as described. At them moment, to obtain it, you download a .zip file which not only has the program (CD93.exe), but also all of the files you would need if you wanted to use Lazarus to make changes to it... But you can download the .zip, extract just the .exe file, and use it, if you like. There's a screenshot and more details in the page explaining the programming. Don't be overwhelmed by the screenshot... very "busy"... but using the application is trivial.



Editorial Philosophy

I dislike 'fancy' websites with more concern for a flashy appearance than for good content. For a pretty picture, I can go to an art gallery. Of course, an attractive site WITH content deserves praise... as long as that pretty face doesn't cost download time. In any case....

I am trying to present this material in a format which makes it easy for you to USE it. There are two aspects to that: The way it is split up, and the way it is posted. See the main index to this material for more information about the way it is split up, and the way it is posted.


Ad from page's editor: Yes.. I do enjoy compiling these things for you... I hope they are helpful. However.. this doesn't pay my bills!!! If you find this stuff useful, (and you run an MS-DOS or Windows PC) please visit my freeware and shareware page, download something, and circulate it for me? Links on your page to this page would also be appreciated!

PLEASE >>> Click here to visit editor's Sheepdog Software (tm) freeware, shareware pages <<< PLEASE


If you liked this ooBase tutorial, see the main index for information other help from the same author.

Editor's email address. Suggestions welcomed!     - - -    Want a site hosted, or email? I like 1&1's services.




Valid HTML 4.01 Transitional Page tested for compliance with INDUSTRY (not MS-only) standards, using the free, publicly accessible validator at validator.w3.org. Mostly passes. There were two "unknown attributes" in Google+ button code. Sigh.


One last bit of advice: Be sure you know all you need to about spyware.

. . . . . P a g e . . . E n d s . . . . .