Developers 42: Ole

This series seems to be plagued by delays. Unfortunately I don't have much time lately. Work is really busy and I'm moving to a new home! So please be patient with me. I'll try to post as often as I can.

In the last post in this series I've talked about some of the theoretical background of opening files that were stored by Access. We've looked at the different headers and at metafilepict blocks. Today it's time to look at some code.

Before we do dive into the code, I would like to point out that part of this code is the library written by Eduardo Morcillo. My code wouldn't run without it.

First order of business is to have structs that can hold my header information:


internal struct PackageHeader
{
public short Signature;
public short HeaderSize;
public uint ObjectType;
public short FriendlyNameLen;
public short ClassNameLen;
public short FrNameOffset;
public short ClassNameOffset;
public int ObjectSize;
public string FriendlyName;
public string ClassName;
}

internal struct OleHeader
{
public uint OleVersion;
public uint Format;
public int ObjectTypeNameLen;
public string ObjectTypeName;
}

As you can see we have two structs that contain information about the package header and the OLE header, based on the information we gathered in the last article. The types are based on the number of bytes each entry can store.

The next order of business is to define some constant values we need during the process.


private const int FixedPackageHeaderSize = 20;
private const int FixedOleHeaderSize = 12;
private const int MetaFileHeaderSize = 45;
private const int BufferSize = 1024;
private const string ContentsEntryName = "CONTENTS";
private const string WorkBookEntryName = "Workbook";
private const string MSPhotoFriendlyName = "MSPhotoEd.3\0";

As you can see some number are here for fixed sizes of headers, a buffer size which is arbitrary, and some string constants we need to identify what type of data we are dealing with.
Finally we also need some private fields in our class that can hold some data for us:


private System.IO.Stream _input;
private long _endOfHeaderPosition;
private int _dataLength;
private PackageHeader _packageHeader;
private OleHeader _oleHeader;

The constructor of our class actually has a parameter of type Stream that is the input of the class. From the constructor a method ReadHeader is called:


private void ReadHeader()
{
if (_input.Position > 0 && _input.CanSeek)
{
_input.Seek(0, SeekOrigin.Begin);
}

byte[] fixedPackageHeaderData = new byte[FixedPackageHeaderSize];
_input.Read(fixedPackageHeaderData, 0, FixedPackageHeaderSize);

PackageHeader packageHeader = new PackageHeader();
packageHeader.Signature = CalcShortFromBytes(new byte[] { fixedPackageHeaderData[0], fixedPackageHeaderData[1] });
packageHeader.HeaderSize = CalcShortFromBytes(new byte[] { fixedPackageHeaderData[2], fixedPackageHeaderData[3] });
packageHeader.ObjectType = CalcUIntFromBytes(new byte[] { fixedPackageHeaderData[4], fixedPackageHeaderData[5], fixedPackageHeaderData[6], fixedPackageHeaderData[7] });
packageHeader.FriendlyNameLen = CalcShortFromBytes(new byte[] { fixedPackageHeaderData[8], fixedPackageHeaderData[9] });
packageHeader.ClassNameLen = CalcShortFromBytes(new byte[] { fixedPackageHeaderData[10], fixedPackageHeaderData[11] });
packageHeader.FrNameOffset = CalcShortFromBytes(new byte[] { fixedPackageHeaderData[12], fixedPackageHeaderData[13] });
packageHeader.ClassNameOffset = CalcShortFromBytes(new byte[] { fixedPackageHeaderData[14], fixedPackageHeaderData[15] });
packageHeader.ObjectSize = CalcIntFromBytes(new byte[] { fixedPackageHeaderData[16], fixedPackageHeaderData[17], fixedPackageHeaderData[18], fixedPackageHeaderData[19] });

byte[] friendlyNameData = new byte[packageHeader.FriendlyNameLen];
_input.Read(friendlyNameData, 0, packageHeader.FriendlyNameLen);
packageHeader.FriendlyName = Encoding.UTF8.GetString(friendlyNameData);

byte[] classNameData = new byte[packageHeader.ClassNameLen];
_input.Read(classNameData, 0, packageHeader.ClassNameLen);
packageHeader.ClassName = Encoding.UTF8.GetString(classNameData);

_packageHeader = packageHeader;

byte[] fixedOleHeaderData = new byte[FixedOleHeaderSize];
_input.Read(fixedOleHeaderData, 0, FixedOleHeaderSize);

OleHeader oleHeader = new OleHeader();
oleHeader.OleVersion = CalcUIntFromBytes(new byte[] { fixedOleHeaderData[0], fixedOleHeaderData[1], fixedOleHeaderData[2], fixedOleHeaderData[3] });
oleHeader.Format = CalcUIntFromBytes(new byte[] { fixedOleHeaderData[4], fixedOleHeaderData[5], fixedOleHeaderData[6], fixedOleHeaderData[7] });
oleHeader.ObjectTypeNameLen = CalcIntFromBytes(new byte[] { fixedOleHeaderData[8], fixedOleHeaderData[9], fixedOleHeaderData[10], fixedOleHeaderData[11] });

byte[] objectTypeNameData = new byte[oleHeader.ObjectTypeNameLen];
_input.Read(objectTypeNameData, 0, oleHeader.ObjectTypeNameLen);
oleHeader.ObjectTypeName = Encoding.UTF8.GetString(objectTypeNameData);

_oleHeader = oleHeader;

for (int index = 0; index < 8; index++)
{
_input.ReadByte();
}

byte[] lengthData = new byte[4];
_input.Read(lengthData, 0, 4);
_dataLength = BitConverter.ToInt32(lengthData, 0);

_endOfHeaderPosition = _input.Position;
}

This method reads the header and decomposes it into entries. This allows us to get to the variable bits and read them correctly as well. It also gives us some important information, being the length of the data block and the end position of the header.

So on to the business end of this class, it's GetStrippedStream method:


public System.IO.Stream GetStrippedStream()
{
if (_input.Position != _endOfHeaderPosition && _input.CanSeek)
{
_input.Seek(_endOfHeaderPosition, SeekOrigin.Begin);
}
if (_packageHeader.ClassName.Equals(MSPhotoFriendlyName, StringComparison.OrdinalIgnoreCase))
{
_input.Seek(_dataLength + MetaFileHeaderSize, SeekOrigin.Current);
}

string tempFileName = Path.GetTempFileName();
FileStream tempFileStream = File.OpenWrite(tempFileName);

byte[] buffer = new byte[BufferSize];
int loadedBytes = _input.Read(buffer, 0, BufferSize);
while (loadedBytes > 0)
{
tempFileStream.Write(buffer, 0, loadedBytes);
loadedBytes = _input.Read(buffer, 0, BufferSize);
}
tempFileStream.Close();

System.IO.Stream outputStream;
bool isCompoundFile = Storage.IsCompoundStorageFile(tempFileName);
if (isCompoundFile)
{
Storage storage = new Storage(tempFileName);
Storage.StorageElementsCollection elements = storage.Elements();
// element.Name.Equals(WorkBookEntryName, StringComparison.OrdinalIgnoreCase)
var result = from StatStg element in elements
where (element.Name.Equals(ContentsEntryName, StringComparison.OrdinalIgnoreCase)
element.Name.Equals(WorkBookEntryName, StringComparison.OrdinalIgnoreCase))
&& element.Type == StatStg.ElementType.Stream
select element;
if (result.Any())
{
outputStream = storage.OpenStream(result.First().Name);
}
else
{
storage.Close();
outputStream = File.OpenRead(tempFileName);
}
}
else
{
outputStream = File.OpenRead(tempFileName);
}
return outputStream;
}

As you can see we first make sure we set the position of the stream to the right position, based on the header information. The then write the stream to a temp file. The temp file is used to work with Eduardo's library. We use it to determine if this file is actually a structured storage and if so we extract only the stream we need. If this is a Microsoft Word document, then it will not have the elements we look for in the structured storage. In this case we want to send the complete file as a result. If this was not a structured storage, then we want to send the complete file as well.

Warning:
The code may suggest this works for Microsoft Excel as well, but unfortunately it doesn't. The reason for this is that whenever Access embeds an Excel file, it will change the structured storage completely, until the point that the original file can not be recovered. I have managed to get the data out of it, but I've not managed to get a properly working Excel file. If anyone can provide me with some more insight into this I would be very greatful.

Below you can find the .cs file with the complete class.

This concludes the series on Ole. I hope you have found it helpful. Please leave any comments and/or questions below. I'm always happy to read and reply.

Update April 5th 2012: Reembedded the download as it was broken.

Well, it's been a while sins my last post. I've been really busy on building the software that makes this article possible and in my spare time I've also been really busy on a website, of which I'll tell you more when the time is ripe.

But down to business, today's post is the start of a short series on Ole and files in Microsoft Access. Unfortunately we're looking at technology of the stone age here and documentation on the subject is hard to find. Still we need a way to extract the files we had stored through Access, to be able to move away from this technology.

In this first article we'll look at some theory and the challenges will become apparent. The later articles will show the code for working with Ole in C#. But before we go on, I would like to express a big 'thank you' to Ernst Raedecker, who provided me with some insight on the actual headers. Also I would like to thank Eduardo Morcillo for providing me with a .NET library to handle structured storage.

Files in Microsoft Access
So let's get started. The first thing we need to understand is why this is so difficult. The Access team, back in the day, needed a way to not only store files in an Access database, but also to present it to the user, with other software installed on the users computer. Because back then, there was no reliable way to find out through the OS what software handled what file type, they had to come up with some other way. That way is OLE: Object Linking and Embedding. It was first released in 1990.

All is well, we just determine the length of the header and footer and chop 'm off, right? Wrong. Unfortunately someone, somewhere decided that it would be more efficient in storage to vary the length of a part of the header based on it's contents. Oh, and someone came up with structured storage, so your original file is actually embedded in a mini file system, which needs reading out... but not always.

Here is how a file is stored inside Access:

Package header
Ole header
Data block length
Data (which can be a structured storage, but it can also be the actual file)
Sometimes a metafilepict block
Ole footer

So as you can see, there is a lot of variation in Ole files. Fortunately they do have some things in common to help with the process of extraction, but if you are looking to get a better understanding, I would advice to install a Hex-editor.

The first thing to look at is both of the headers. The package header looks like this:

A Signature (short): this indicates that the file is a package
The header size (short)
An object type (uint): 0 = linked, 1= embedded, 2 = either (who came up with either?)
The length of the friendly name in the header (short)
The length of the class name in the header (short)
The offset of the friendly name (short)
The offset of the class name (short)
The size of the object (int)
The friendly name (string, variable length)
The class name (string, variable length)

So, as you can see, you need to actually interpret the package header to get the right length, to skip over the header, but also you can learn what is inside the package by reading some of the information it provides.

The Ole header has a lot less information inside it:

The Ole version (uint)
The Format (uint)
The object type name length (int)
The object type name (string, variable length)

The Ole header actually ends with 8 empty bytes followed by 4 bytes that make up the length of the datablock as an int.

As I explained earlier, inside the datablock can be the actual file, but also there can be a structured storage. To determine if this is a structured storage, there is actually an 8 byte signature at the start of the storage.

Inside the storage, there is a mini file system, of which you can find the specs here. If you look in to them, you'll understand why I chose to use Eduardo's library, instead of building my own :-). Say, "thank you, Eduardo".

But... you should not always dig into the structured storage. If, for example you stored a Microsoft Word document in Access, you need to leave the structured storage in tact, as it's part of the Word document. So at this part, the class name from the package header comes in handy.

If you do have to dig into the structured storage, what you're looking for is the stream inside the CONTENTS element. It is the binary stream that makes up the original file.

Now, if you're dealing with images, like I am, there is another catch, called the metafilepict block. You're not likely to find a CONTENTS element if you need to read the metafilepict block. The big question for me was, where is the metafilepict block. It turns out that it is at a position you can calculate like this:

metefilepict start position = total package header length + total ole header length + 8 (empty bytes) + 4 (data length bytes) + data length + 45 (metafilepict header length)

The stream at this position contains your actual image file.

So now you know how it all works. In the next post we'll dive into some code I wrote to handle all this. If you have any questions, comments or suggestions, please leave us all a comment.

Developers 42

Wednesday, February 4, 2009

Ole and accessing files embedded in Access part #2

Wednesday, January 21, 2009

Ole and accessing files embedded in Access part #1

Subscribe via email

Search

Labels

My Blog List

About Me

Blog Archive

Privacy