Wednesday, January 21, 2009

Ole and accessing files embedded in Access part #1

Well, it's been a while sins my last post. I've been really busy on building the software that makes this article possible and in my spare time I've also been really busy on a website, of which I'll tell you more when the time is ripe.

But down to business, today's post is the start of a short series on Ole and files in Microsoft Access. Unfortunately we're looking at technology of the stone age here and documentation on the subject is hard to find. Still we need a way to extract the files we had stored through Access, to be able to move away from this technology.

In this first article we'll look at some theory and the challenges will become apparent. The later articles will show the code for working with Ole in C#. But before we go on, I would like to express a big 'thank you' to Ernst Raedecker, who provided me with some insight on the actual headers. Also I would like to thank Eduardo Morcillo for providing me with a .NET library to handle structured storage.

Files in Microsoft Access
So let's get started. The first thing we need to understand is why this is so difficult. The Access team, back in the day, needed a way to not only store files in an Access database, but also to present it to the user, with other software installed on the users computer. Because back then, there was no reliable way to find out through the OS what software handled what file type, they had to come up with some other way. That way is OLE: Object Linking and Embedding. It was first released in 1990.

All is well, we just determine the length of the header and footer and chop 'm off, right? Wrong. Unfortunately someone, somewhere decided that it would be more efficient in storage to vary the length of a part of the header based on it's contents. Oh, and someone came up with structured storage, so your original file is actually embedded in a mini file system, which needs reading out... but not always.

Here is how a file is stored inside Access:
  • Package header
  • Ole header
  • Data block length
  • Data (which can be a structured storage, but it can also be the actual file)
  • Sometimes a metafilepict block
  • Ole footer

So as you can see, there is a lot of variation in Ole files. Fortunately they do have some things in common to help with the process of extraction, but if you are looking to get a better understanding, I would advice to install a Hex-editor.

The first thing to look at is both of the headers. The package header looks like this:

  • A Signature (short): this indicates that the file is a package
  • The header size (short)
  • An object type (uint): 0 = linked, 1= embedded, 2 = either (who came up with either?)
  • The length of the friendly name in the header (short)
  • The length of the class name in the header (short)
  • The offset of the friendly name (short)
  • The offset of the class name (short)
  • The size of the object (int)
  • The friendly name (string, variable length)
  • The class name (string, variable length)

So, as you can see, you need to actually interpret the package header to get the right length, to skip over the header, but also you can learn what is inside the package by reading some of the information it provides.

The Ole header has a lot less information inside it:

  • The Ole version (uint)
  • The Format (uint)
  • The object type name length (int)
  • The object type name (string, variable length)

The Ole header actually ends with 8 empty bytes followed by 4 bytes that make up the length of the datablock as an int.

As I explained earlier, inside the datablock can be the actual file, but also there can be a structured storage. To determine if this is a structured storage, there is actually an 8 byte signature at the start of the storage.

Inside the storage, there is a mini file system, of which you can find the specs here. If you look in to them, you'll understand why I chose to use Eduardo's library, instead of building my own :-). Say, "thank you, Eduardo".

But... you should not always dig into the structured storage. If, for example you stored a Microsoft Word document in Access, you need to leave the structured storage in tact, as it's part of the Word document. So at this part, the class name from the package header comes in handy.

If you do have to dig into the structured storage, what you're looking for is the stream inside the CONTENTS element. It is the binary stream that makes up the original file.

Now, if you're dealing with images, like I am, there is another catch, called the metafilepict block. You're not likely to find a CONTENTS element if you need to read the metafilepict block. The big question for me was, where is the metafilepict block. It turns out that it is at a position you can calculate like this:

metefilepict start position = total package header length + total ole header length + 8 (empty bytes) + 4 (data length bytes) + data length + 45 (metafilepict header length)

The stream at this position contains your actual image file.

So now you know how it all works. In the next post we'll dive into some code I wrote to handle all this. If you have any questions, comments or suggestions, please leave us all a comment.

11 comments:

  1. hi Jonathan,
    Great article and very helpfull.

    I have been trying to strip content from several binary data with OLE format using your solution.
    It works fine with Word documents (Word.Document.8) and PDF (AcroExch.Document.7)but not with any kind of images. Firstly all images in my cases are allways marked with label 'Package' (not MSPhotoEd.3). But anyway I forced them to skip the metafilepict header and I cannot find the useful content niether in CONTENT block nor in metafilepict block. Pardon, in the metafilepict there is a something like an icon, but not the image per se.

    I save the streams as image files and trying to open them from File System and is imposible to do it. They are bot recognized as images files.

    You said in your article:

    "...metefilepict start position = total package header length + total ole header length + 8 (empty bytes) + 4 (data length bytes) + data length + 45 (metafilepict header length)

    The stream at this position contains your actual image file....."

    So in this case what is the CONTENT block for? What information is stored there?

    Thank you for your help

    Regards
    Juan

    ReplyDelete
  2. Sorry it was a typo in my last comment:

    "...I save the streams as image files and trying to open them from File System and is imposible to do it. They are BOT recognized as images files...."

    Actually I meant:

    "I save the streams as image files and trying to open them from File System and is imposible to do it. They are NOT recognized as images files."

    ReplyDelete
  3. Hi Juan,

    Thanks for your comment. Have you read part #2 of this article? You can find it here.

    There is one thing I should clear up at this point. We used MS Photo Editor on the systems that saved the images in the OLE containers. This is why they are in there with the MSPhotoEd.3 type in the first place. Determining the type of your file when the header reads type package is a hard one to crack. It can be any file type and headers might be missing.

    What might be helpful is to save an image you have through the same mechanism that was used for the objects already there and then try working with that image. You can do compares with a Hex editor to see what's changed.

    HTH.
    Jonathan

    ReplyDelete
  4. Hi Jonathan,

    Thanks for your answer.
    Yes I had read part #2. It had been so usefull.
    I could extract the content from OLE File without problem. I think my issues are related with images format. The matter is that I dont know how to deal with metefilepict.

    Anyway, your advise has been useful. I am trying by this way.

    Thks again
    Juan

    ReplyDelete
  5. Hi there.

    These articles helped me to better understand some details of MS Access OLE Package structure - thanks for it.

    Recently I worked on a similar problem - we had to export some photos from MS SQL Server to MS Access.mdb according to the requirements of our business partner.

    In MS SQL Server photos are saved as binary images of the original .jpg files, in the mdb they are expected as OLE Objects.

    Fortunately, we managed to write a simple C# .NET 3.5 code for this and now we offer it here to a public use or possible improvements:

    http://www.ipex.sk/downloads/LibInterop.zip

    For any comment you are welcome at ssykora[at]ipex.sk.

    Regards

    Stanislav Sýkora

    ReplyDelete
  6. Hi Stanislav,

    Thanks for the heads up and great work on wrapping images back into OLE Objects.

    Greets,
    Jonathan

    ReplyDelete
  7. Hello,

    basically it has the simple and free .NET library to extract native data from Access OLE. You can check it here http://intasphere.ru/en/oleextract.php

    ReplyDelete
  8. Hi Sergey,

    Thanks for the heads up.

    Greets,
    Jonathan

    ReplyDelete
  9. I am accessing an Ms Access database through C#. I am able to read all the fields. The problem that I am getting is, while reading ".txt" and ".doc" files that are stored in OLE Object field of the table, a lot of extra junk characters are also getting read before and after the actual text.

    I found the solution for word documents (.doc files). For word documents, "Extra OLE Content Length" that gets added while retrieving through query is 85 bytes. So I strip 85 bytes from both ends of the byte array like I did in below sample code-
    ------------------
    Con.Open();
    string _query="select licenseDoc from Products where ID=56";
    //Column licenseDoc contains word and text douments as OLE Objects
    OleDbCommand Cmd = new OleDbCommand(_query, Con);

    const int offset =85;
    var oleBytes = (Byte[])Cmd.ExecuteScalar();
    MemoryStream ms = new MemoryStream();
    ms.Write(oleBytes, offset, oleBytes.Length - offset);

    var filePath = Path.GetTempFileName();
    using (var fileStream = File.OpenWrite(file))
    {
    var buffer = ms.GetBuffer();
    fileStream.Write(buffer, 0, (int)ms.Length);
    }
    ----------------
    The variable "filePath" will contain the path of the .tmp file, which contains the data read from from the word document stored as an OLE object in Ms Access. This file can be directly opened in as a word document or it's extension can be changed .doc.

    The "Extra OLE Content Length" for other formats are as follows:

    1] JPEG/JPG=224
    2] BMP=78
    3] PDF=85
    4] SNP=74
    5] DOC=85

    I don't know the "Extra OLE Content Length" of .txt(Simple Text) files stored in Access. Any idea what it could be?

    ReplyDelete
  10. Hi Sunil,

    First of all thanks for your extensive comment with some nice additions to the blog post.
    In order to find out the length of an ole header, you could simply use the class that is explained and downloadable in the second post of this series. You can find it here.

    Once you execute the ReadHeader method in that class, the _endOfHeaderPosition field contains the position after the header was written (which should be the number you're looking for).

    HTH.

    Jonathan

    ReplyDelete
    Replies
    1. Hi Jonathan,

      Thank you for the quick response.

      I used your code and _endOfHeaderPosition field gives me 85 for word(".doc") files. I use this 85 to strip OLE header and footer. So it works great with doc files.
      ----------------------------------------
      The _endOfHeaderPosition field gives me 68 for simple Text(".txt") files. But still even after stripping off content from both ends of the file, I get lot of junk characters at the bottom of the file and few junk characters at the beginning like-

      LICENSE-ANTLR-2 7 5.txt C:\WINDOWS\system32\SHELL32.dll E \\insidecontent1\ThirdPartyDatabase\THIRDP~1\ANTLR2~1.5\LICENS~1.TXT ®

      I have noticed that after "~1.TXT"(6 characters), there is always extra 5 characters before the start of actual file content. So I use this pattern to strip extra OLE content before the file content.

      At the bottom of the file content, the Extra OLE content starts with a pattern like-


      METAFILEPICT

      Even though I am not able to capture other junk characters, I get the index of "METAFILEPICT", and then strip of the extra OLE junk characters from the bottom.

      Is there a better way for doing it? Will I get into any problem by stripping of the OLE content as I am doing now?

      Delete