Wednesday, February 4, 2009

Ole and accessing files embedded in Access part #2

This series seems to be plagued by delays. Unfortunately I don't have much time lately. Work is really busy and I'm moving to a new home! So please be patient with me. I'll try to post as often as I can.

In the last post in this series I've talked about some of the theoretical background of opening files that were stored by Access. We've looked at the different headers and at metafilepict blocks. Today it's time to look at some code.

Before we do dive into the code, I would like to point out that part of this code is the library written by Eduardo Morcillo. My code wouldn't run without it.

First order of business is to have structs that can hold my header information:

internal struct PackageHeader
{
public short Signature;
public short HeaderSize;
public uint ObjectType;
public short FriendlyNameLen;
public short ClassNameLen;
public short FrNameOffset;
public short ClassNameOffset;
public int ObjectSize;
public string FriendlyName;
public string ClassName;
}

internal struct OleHeader
{
public uint OleVersion;
public uint Format;
public int ObjectTypeNameLen;
public string ObjectTypeName;
}

As you can see we have two structs that contain information about the package header and the OLE header, based on the information we gathered in the last article. The types are based on the number of bytes each entry can store.

The next order of business is to define some constant values we need during the process.

private const int FixedPackageHeaderSize = 20;
private const int FixedOleHeaderSize = 12;
private const int MetaFileHeaderSize = 45;
private const int BufferSize = 1024;
private const string ContentsEntryName = "CONTENTS";
private const string WorkBookEntryName = "Workbook";
private const string MSPhotoFriendlyName = "MSPhotoEd.3\0";

As you can see some number are here for fixed sizes of headers, a buffer size which is arbitrary, and some string constants we need to identify what type of data we are dealing with.
Finally we also need some private fields in our class that can hold some data for us:

private System.IO.Stream _input;
private long _endOfHeaderPosition;
private int _dataLength;
private PackageHeader _packageHeader;
private OleHeader _oleHeader;

The constructor of our class actually has a parameter of type Stream that is the input of the class. From the constructor a method ReadHeader is called:

private void ReadHeader()
{
if (_input.Position > 0 && _input.CanSeek)
{
_input.Seek(0, SeekOrigin.Begin);
}

byte[] fixedPackageHeaderData = new byte[FixedPackageHeaderSize];
_input.Read(fixedPackageHeaderData, 0, FixedPackageHeaderSize);

PackageHeader packageHeader = new PackageHeader();
packageHeader.Signature = CalcShortFromBytes(new byte[] { fixedPackageHeaderData[0], fixedPackageHeaderData[1] });
packageHeader.HeaderSize = CalcShortFromBytes(new byte[] { fixedPackageHeaderData[2], fixedPackageHeaderData[3] });
packageHeader.ObjectType = CalcUIntFromBytes(new byte[] { fixedPackageHeaderData[4], fixedPackageHeaderData[5], fixedPackageHeaderData[6], fixedPackageHeaderData[7] });
packageHeader.FriendlyNameLen = CalcShortFromBytes(new byte[] { fixedPackageHeaderData[8], fixedPackageHeaderData[9] });
packageHeader.ClassNameLen = CalcShortFromBytes(new byte[] { fixedPackageHeaderData[10], fixedPackageHeaderData[11] });
packageHeader.FrNameOffset = CalcShortFromBytes(new byte[] { fixedPackageHeaderData[12], fixedPackageHeaderData[13] });
packageHeader.ClassNameOffset = CalcShortFromBytes(new byte[] { fixedPackageHeaderData[14], fixedPackageHeaderData[15] });
packageHeader.ObjectSize = CalcIntFromBytes(new byte[] { fixedPackageHeaderData[16], fixedPackageHeaderData[17], fixedPackageHeaderData[18], fixedPackageHeaderData[19] });

byte[] friendlyNameData = new byte[packageHeader.FriendlyNameLen];
_input.Read(friendlyNameData, 0, packageHeader.FriendlyNameLen);
packageHeader.FriendlyName = Encoding.UTF8.GetString(friendlyNameData);

byte[] classNameData = new byte[packageHeader.ClassNameLen];
_input.Read(classNameData, 0, packageHeader.ClassNameLen);
packageHeader.ClassName = Encoding.UTF8.GetString(classNameData);

_packageHeader = packageHeader;

byte[] fixedOleHeaderData = new byte[FixedOleHeaderSize];
_input.Read(fixedOleHeaderData, 0, FixedOleHeaderSize);

OleHeader oleHeader = new OleHeader();
oleHeader.OleVersion = CalcUIntFromBytes(new byte[] { fixedOleHeaderData[0], fixedOleHeaderData[1], fixedOleHeaderData[2], fixedOleHeaderData[3] });
oleHeader.Format = CalcUIntFromBytes(new byte[] { fixedOleHeaderData[4], fixedOleHeaderData[5], fixedOleHeaderData[6], fixedOleHeaderData[7] });
oleHeader.ObjectTypeNameLen = CalcIntFromBytes(new byte[] { fixedOleHeaderData[8], fixedOleHeaderData[9], fixedOleHeaderData[10], fixedOleHeaderData[11] });

byte[] objectTypeNameData = new byte[oleHeader.ObjectTypeNameLen];
_input.Read(objectTypeNameData, 0, oleHeader.ObjectTypeNameLen);
oleHeader.ObjectTypeName = Encoding.UTF8.GetString(objectTypeNameData);

_oleHeader = oleHeader;

for (int index = 0; index < 8; index++)
{
_input.ReadByte();
}

byte[] lengthData = new byte[4];
_input.Read(lengthData, 0, 4);
_dataLength = BitConverter.ToInt32(lengthData, 0);

_endOfHeaderPosition = _input.Position;
}

This method reads the header and decomposes it into entries. This allows us to get to the variable bits and read them correctly as well. It also gives us some important information, being the length of the data block and the end position of the header.

So on to the business end of this class, it's GetStrippedStream method:

public System.IO.Stream GetStrippedStream()
{
if (_input.Position != _endOfHeaderPosition && _input.CanSeek)
{
_input.Seek(_endOfHeaderPosition, SeekOrigin.Begin);
}
if (_packageHeader.ClassName.Equals(MSPhotoFriendlyName, StringComparison.OrdinalIgnoreCase))
{
_input.Seek(_dataLength + MetaFileHeaderSize, SeekOrigin.Current);
}

string tempFileName = Path.GetTempFileName();
FileStream tempFileStream = File.OpenWrite(tempFileName);

byte[] buffer = new byte[BufferSize];
int loadedBytes = _input.Read(buffer, 0, BufferSize);
while (loadedBytes > 0)
{
tempFileStream.Write(buffer, 0, loadedBytes);
loadedBytes = _input.Read(buffer, 0, BufferSize);
}
tempFileStream.Close();

System.IO.Stream outputStream;
bool isCompoundFile = Storage.IsCompoundStorageFile(tempFileName);
if (isCompoundFile)
{
Storage storage = new Storage(tempFileName);
Storage.StorageElementsCollection elements = storage.Elements();
// element.Name.Equals(WorkBookEntryName, StringComparison.OrdinalIgnoreCase)
var result = from StatStg element in elements
where (element.Name.Equals(ContentsEntryName, StringComparison.OrdinalIgnoreCase)
element.Name.Equals(WorkBookEntryName, StringComparison.OrdinalIgnoreCase))
&& element.Type == StatStg.ElementType.Stream
select element;
if (result.Any())
{
outputStream = storage.OpenStream(result.First().Name);
}
else
{
storage.Close();
outputStream = File.OpenRead(tempFileName);
}
}
else
{
outputStream = File.OpenRead(tempFileName);
}
return outputStream;
}

As you can see we first make sure we set the position of the stream to the right position, based on the header information. The then write the stream to a temp file. The temp file is used to work with Eduardo's library. We use it to determine if this file is actually a structured storage and if so we extract only the stream we need. If this is a Microsoft Word document, then it will not have the elements we look for in the structured storage. In this case we want to send the complete file as a result. If this was not a structured storage, then we want to send the complete file as well.

Warning:
The code may suggest this works for Microsoft Excel as well, but unfortunately it doesn't. The reason for this is that whenever Access embeds an Excel file, it will change the structured storage completely, until the point that the original file can not be recovered. I have managed to get the data out of it, but I've not managed to get a properly working Excel file. If anyone can provide me with some more insight into this I would be very greatful.

Below you can find the .cs file with the complete class.


This concludes the series on Ole. I hope you have found it helpful. Please leave any comments and/or questions below. I'm always happy to read and reply.

Update April 5th 2012: Reembedded the download as it was broken.

6 comments:

  1. many thx for your solution I was looking since last week. but two things I'm still looking for. how can I retrieve the filename and the fileextension of the saved OLE-object?

    ReplyDelete
  2. Hi domasch,

    Your welcome. Unfortunately for you this is complicated. If you don't have a Structured Storage then the OLE object doesn't contain filename information.

    Maybe the file data itself contains this information, but you would have to handle each file on it's own and this would be well beyond the scope of this article.

    If you do have a Structured Storage, you might find something in the StatStg class, but I'm not sure a filename is available there either.

    For our implementation this is no issue as we store the original filename in a field next to the file.

    Hope this helped you.

    Greets,
    Jonathan

    ReplyDelete
  3. Hi Jonathan

    I have tried to use your code to extract jpeg data from an access database, with no success yet. Everything runs but the browser just renders the red X! The original ole objects seem to be of type MS_ClipArt_Gallery.5. Whereas your code seems to cater for MSPhotoEd.3 objects. Any clues as to how to customise your code to work for MS_ClipArt_Gallery.5 objects?

    Many thanks
    Gareth

    ReplyDelete
  4. Hi Gareth,

    I would expect that the ClipArt type your using has a different header length from the PhotoEd type. You'd have to figure out the length needed. After that all that remains is to add the type to the GetStrippedStream method, in the same way the PhotoEd type is handled there.

    So where it now checks for the PhotoEd type, there should be another if to check for the ClipArt type. If it is the ClipArt type, do a seek from the current point to the _dataLength + the header size for a ClipArt file.

    I hope that helps you. If you figure it out, or if you need more help, please let us know.

    Greets,
    Jonathan

    ReplyDelete
  5. Great article, Jonathan, thanks for posting. I'm working on implementing a process that extracts embedded objects from Excel, Word & PowerPoint. Do you know if this code should also work on those as well? The program that I created is already working very well for the most part, with a few loose ends yet to tie up. I can extract most files just fine, but the 'MSPhotoEd' file type is giving me some headaches. I believe it is because I am not finding the correct point where the header ends and the file content begins, so I'm looking to the header parsing routine to get the info that I am missing. So far my attempts have not been successful as the headers that I have processed do not seem to follow the same pattern. If given the streams '\0Ole', '\0CompObj', 'CONTENTS' & 'CONTENTSV30', can you clarify which stream is represented by '_input' in your ReadHeader routine? Or if by chance I should be using some other stream or binary content? Any help is appreciated. Thanks.

    ReplyDelete
  6. Hi there,

    The short answer: You need the complete package object. It starts with Hex values 15 1C. What is does is getting you the actual content stream.

    To get a better understanding of this, have a look at part one of this article: here.

    It describes what headers are actualy handled by the OleStripper class.
    If you have questions after that, please let me know. I'm always happy to help.

    Greets,
    Jonathan

    ReplyDelete