But down to business, today's post is the start of a short series on Ole and files in Microsoft Access. Unfortunately we're looking at technology of the stone age here and documentation on the subject is hard to find. Still we need a way to extract the files we had stored through Access, to be able to move away from this technology.
In this first article we'll look at some theory and the challenges will become apparent. The later articles will show the code for working with Ole in C#. But before we go on, I would like to express a big 'thank you' to Ernst Raedecker, who provided me with some insight on the actual headers. Also I would like to thank Eduardo Morcillo for providing me with a .NET library to handle structured storage.
Files in Microsoft Access
So let's get started. The first thing we need to understand is why this is so difficult. The Access team, back in the day, needed a way to not only store files in an Access database, but also to present it to the user, with other software installed on the users computer. Because back then, there was no reliable way to find out through the OS what software handled what file type, they had to come up with some other way. That way is OLE: Object Linking and Embedding. It was first released in 1990.
All is well, we just determine the length of the header and footer and chop 'm off, right? Wrong. Unfortunately someone, somewhere decided that it would be more efficient in storage to vary the length of a part of the header based on it's contents. Oh, and someone came up with structured storage, so your original file is actually embedded in a mini file system, which needs reading out... but not always.
Here is how a file is stored inside Access:
- Package header
- Ole header
- Data block length
- Data (which can be a structured storage, but it can also be the actual file)
- Sometimes a metafilepict block
- Ole footer
So as you can see, there is a lot of variation in Ole files. Fortunately they do have some things in common to help with the process of extraction, but if you are looking to get a better understanding, I would advice to install a Hex-editor.
The first thing to look at is both of the headers. The package header looks like this:
- A Signature (short): this indicates that the file is a package
- The header size (short)
- An object type (uint): 0 = linked, 1= embedded, 2 = either (who came up with either?)
- The length of the friendly name in the header (short)
- The length of the class name in the header (short)
- The offset of the friendly name (short)
- The offset of the class name (short)
- The size of the object (int)
- The friendly name (string, variable length)
- The class name (string, variable length)
So, as you can see, you need to actually interpret the package header to get the right length, to skip over the header, but also you can learn what is inside the package by reading some of the information it provides.
The Ole header has a lot less information inside it:
- The Ole version (uint)
- The Format (uint)
- The object type name length (int)
- The object type name (string, variable length)
The Ole header actually ends with 8 empty bytes followed by 4 bytes that make up the length of the datablock as an int.
As I explained earlier, inside the datablock can be the actual file, but also there can be a structured storage. To determine if this is a structured storage, there is actually an 8 byte signature at the start of the storage.
Inside the storage, there is a mini file system, of which you can find the specs here. If you look in to them, you'll understand why I chose to use Eduardo's library, instead of building my own :-). Say, "thank you, Eduardo".
But... you should not always dig into the structured storage. If, for example you stored a Microsoft Word document in Access, you need to leave the structured storage in tact, as it's part of the Word document. So at this part, the class name from the package header comes in handy.
If you do have to dig into the structured storage, what you're looking for is the stream inside the CONTENTS element. It is the binary stream that makes up the original file.
Now, if you're dealing with images, like I am, there is another catch, called the metafilepict block. You're not likely to find a CONTENTS element if you need to read the metafilepict block. The big question for me was, where is the metafilepict block. It turns out that it is at a position you can calculate like this:
metefilepict start position = total package header length + total ole header length + 8 (empty bytes) + 4 (data length bytes) + data length + 45 (metafilepict header length)
The stream at this position contains your actual image file.
So now you know how it all works. In the next post we'll dive into some code I wrote to handle all this. If you have any questions, comments or suggestions, please leave us all a comment.