How to Extract Embedded Objects from Microsoft Office by Java

I discuss Java code examples on how to use Apache POI to extract embedded objects from Word, Excel and PowerPoint for both ole2 and ooxml formats, both special workarounds and consistent APIs are given.

  1. OLE 2: as POI documents, the embedded objects are stored in subdirectories of the OLE 2 (POIFS) filesystem. The exact location of the embedded documents will vary depending on the type of the master document. Word and Excel have similar structure but with different directory naming pattern and we need to iterate over all the directories in the filesystem. Embedded objects in PowerPoint is constructed differently and we have API to access them.

  2. OOXML: POI provides a consistent API getAllEmbedds() to access all of the three, that is, Word, Excel and PowerPoint.

The first following three sections are for ole2 based documents and the last section is for ooxml format. Alright, now let’s look into the details.

Files embedded in Word

Word normally stores embedded files in subdirectories of the ObjectPool directory, itself a subdirectory of the filesystem root. Typically these subdirectories and named starting with an underscore, followed by 10 numbers.

The following is a sample strcutre from one of my testing case:

Root Entry
    SummaryInformation
    DocumentSummaryInformation
    WordDocument
    1Table
    ObjectPool
        _1541498334
            Pictures
            SummaryInformation
            PowerPoint Document
            DocumentSummaryInformation
            Current User
            CompObj
            ObjInfo
            Ole
        _1541498335
            EPRINT
            ObjInfo
            SummaryInformation
            DocumentSummaryInformation
            Workbook
            CompObj
            Ole
        _1541497951
            CompObj
            WordDocument
            SummaryInformation
            DocumentSummaryInformation
            ObjInfo
            1Table
    CompObj
    Data

Files embedded in Excel

Excel normally stores embedded files in subdirectories of the filesystem root. Typically these subdirectories are named starting with MBD, with 8 hex characters following.

The following is a sample structure:

Root Entry
    MBD00006170
        CompObj
        Current User
        PowerPoint Document
        DocumentSummaryInformation
        SummaryInformation
        Pictures
        Ole
    SummaryInformation
    DocumentSummaryInformation
    Workbook
    CompObj

So for both Word and Excel as the master document, the structure is similar and we need to iterate through the file system to access the embedded objects, the following sample code should provide the access:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
    // TODO match pattern to identify embedded objects
    protected static boolean isEmbeddedPart(org.apache.poi.poifs.filesystem.Entry entry);

    public static void process(org.apache.poi.poifs.filesystem.Entry entry) {
        if (entry == null) {
            return;
        }

        if (log.isDebugEnabled()) {
            log.debug(indent(entry.getName()));
        }

        if (isEmbeddedPart(entry)) {
            // TODO Extract Content
        }

        if (entry.isDirectoryEntry()) {
            DirectoryEntry dir = (DirectoryEntry) entry;
            dir.forEach(e -> {
                process(e);
            });
        }
    }

So based on the above Java code, if it is a Word master document, we just need to match the entry name with the pattern that starts with understore and followed by 10 digits, similarly, we just need to change the matching pattern for Excel master document.

Also after we decide an entry is embedded objects, we can pass it directly to the according constructor of either WordExtractor or ExcelExtractor, both of them acceppt DirectoryNode which implements DirectoryEntry.

Files embedded in PowerPoint

PowerPoint does not normally store embedded files in the OLE2 layer. Instead, they are held within records of the main PowerPoint file.

The following is a sample structure for PowerPoint and we can see there is no similar embedded object structure as for Word or Excel.

    Root Entry
        SummaryInformation
        PowerPoint Document
        DocumentSummaryInformation
        Pictures
        Current User

However, PowerPointExtractor does give another public method to access its embedded objects:

1
2
3
4
5
6
7
8
9
10
11
12
13
            org.apache.poi.hslf.extractor.PowerPointExtractor ppt; // constructor

            for (OLEShape obj : ppt.getOLEShapes()) {
                if (obj.getFullName().startsWith("Microsoft Excel")){
                    // TODO
                } else if (obj.getFullName().startsWith("Microsoft PowerPoint")) {
                    // TODO
                } else if (obj.getFullName().startsWith("Microsoft Word")) {
                    // TODO
                } else {
                    // TODO 
                }
            }

We can also get the data object and inputstream from the OLEShape obj: obj.getObjectData().getData(), and all of the Word, Excel and PowerPoint extractor constructor accepts InputStream.

OOXML formats

For OOXML based documents, we need to use different classes provided by POI

  1. XWPFDocument for Word
  2. XSSFWorkbook for Excel
  3. XMLSlideShow for PowerPoint

It is more consistent and easier when dealing with ooxml, as all of the above three provides a consistent API call: getAllEmbedds() that returns all of the embedded objects in a list of org.apache.poi.openxml4j.opc.PackagePart, we can get the InputStream from PackagePart easily and create the corresponding extractor again.

Summary

I discuss Java code examples on how to use Apache POI to extract embedded objects from Word, Excel and Power Point for both ole2 and ooxml formats, both special workarounds and consistent APIs are given.

Reference:

http://poi.apache.org/poifs/embeded.html

Written on December 5, 2016