David Cowen's Sunday Funday contests are great. In short, a question is posed about a subject, and readers get the opportunity to answer and explain how it is relevant in a forensic investigation. Much can be learned from these contests, as a great answer is almost always delivered and posted publicly. I highly recommend checking it out and submitting answers (it's a weekly thing, so the subjects vary widely).
This week's (2/16/14) subject was PDF Metadata. I took a stab at answering and wound up winning, so I figured it wouldn't hurt to post my answer here. The original answer, along with a preface by David Cowen, can be found here.
The Questions
The subject's prompt was as follows:
1. What metadata can be present within a PDF document?
2. What affects the types of metadata that will be present within a PDF document?
The Answer
The following is my answer to the prompt. Disclaimer: this was written in the wee-hours of one night, so if there are any typos or errors, please feel free to contact me.
----
Much can be gleaned from PDF metadata. Of course, there are the standard fields that will provide you with relatively common metadata...but depending on the program you use to create the PDF, there could be much, much more.
Let's start with the most common metadata. Most PDFs will have embedded metadata showing the PDF version, creation date, creation program, document ID, and last modified date. There are definitely more, but I have left them out as they wouldn't be very useful in an investigation (e.g. page count). The use for the aforementioned metadata is fairly obvious, but I will explain nonetheless.
The more obvious PDF metadata entries are:
- Creation program: program used to create the PDF (was it through desktop software, was it scanned, etc.)
- Creator/Author/Producer: Username or full name of the PDF's author OR further details on the program used to create the PDF (is it a previous employer?)
- Title: the title of the PDF that usually provides an outdated name for the document; good for identifying previous employer documents or documents that have been converted from one format into a PDF (e.g. SecretBluePrint.eps or oldCompanyFinances.doc shows up in the 'title' metadata entry)
Those are the easy ones. But what about the more
overlooked metadata? As I mentioned before, the program used to create
or modify the PDF may have a huge impact on what information you are
given. With that, let's look into it.
First, timestamps. We know that file metadata
could potentially serve as a better indicator of when a given document
was created. If the PDF has been transferred across various volumes and
systems -- and we would like to find the origin of the document -- the
creation date in the file metadata is going to be more reliable than the
file system creation date (as the filesystem date/time will have been updated with the
copies/moves).
The 'modified' date can be used in a similar way. We might even be able to tell how many programs through which the PDF was modified/saved. Say we have a PDF created using Adobe InDesign. If we were to open this PDF, modify it, and then save it as a new file using 'Save As...' in a program like Adobe Acrobat, we would see that the 'creation' date is still unchanged, but the 'modified' date had been updated (file system creation dates will tell us differently). Pretty standard stuff. Even if the PDF is saved using 'Save As...' (essentially creating a new file altogether with an updated file system creation time) AND it is moved from one system to the next, we will still have a genuine 'creation' date. Not only that, but we will have a metadata 'modified' date AND a new file system creation time to work with. Correlation among file metadata and file system timestamps is beyond the scope of this answer, but you get the point; 'creation' and 'modified' metadata dates are powerful and can be used creatively.
A More Reliable Creation Date
The metadata 'creation' date will [usually] preserve the REAL date of the file's creation. That is, if the PDF has been transferred across various volumes and systems, the 'creation' date in the file's metadata is going to give us a better idea of when the document was initially created.
The metadata 'creation' date will [usually] preserve the REAL date of the file's creation. That is, if the PDF has been transferred across various volumes and systems, the 'creation' date in the file's metadata is going to give us a better idea of when the document was initially created.
The 'modified' date can be used in a similar way. We might even be able to tell how many programs through which the PDF was modified/saved. Say we have a PDF created using Adobe InDesign. If we were to open this PDF, modify it, and then save it as a new file using 'Save As...' in a program like Adobe Acrobat, we would see that the 'creation' date is still unchanged, but the 'modified' date had been updated (file system creation dates will tell us differently). Pretty standard stuff. Even if the PDF is saved using 'Save As...' (essentially creating a new file altogether with an updated file system creation time) AND it is moved from one system to the next, we will still have a genuine 'creation' date. Not only that, but we will have a metadata 'modified' date AND a new file system creation time to work with. Correlation among file metadata and file system timestamps is beyond the scope of this answer, but you get the point; 'creation' and 'modified' metadata dates are powerful and can be used creatively.
Also, with many PDF timestamps, we will be able to see a timezone offset. For example, a creation timestamp could be 2013:02:22 11:21:34-06:00. We now have a potential indication that the program that produced this PDF was set in Mountain time.
I mentioned that we might be able to determine if a PDF was created and modified through more than one program. As a quick side note, and if we really wanted to dive into the PDF analysis, we could take a look at some of the other telling metadata. The example above suggested creating a PDF in InDesign, opening it up in Acrobat, modifying it, and then saving it as a new file. When this happens, some of the metadata in the new file (like the 'modified' time) is updated while all of the InDesign metadata stays intact. However, there is a significant difference this time around: the 'XMP Toolkit' metadata value is different. Adobe implements their XMP Toolkit in all of their applications and plugins. They even open sourced it, so other programs can use it (and many do). The point is, "the XMPCore component of this toolkit is what allows for the creation and modification of the metadata that follows the XMP Data Model" (more here and here). So we have two PDFs, but the metadata for each was manipulated by two different versions of Adobe's XMP Core.
InDesign used:
Acrobat used:
"Adobe XMP Core 5.4-c005 78.147326, 2012/08/23-13:03:03"
But
why is this important? Well, we can now more accurately pinpoint the
program used to create the PDF. Sure, we will likely already have a
metadata entry that tells us the 'Creation Program,' but consider the
above example; that tool (InDesign) may have been used to initially
create the PDF, but it was NOT used to open, modify, and save a new
version of it (Acrobat did that). Let's keep this in mind as we explain
some other interesting metadata...
Remember: the amount of metadata that a program uses when creating files is limitless. XMP is built on XML, so any metadata tags can be defined. Let's take a real-world example of how powerful PDF metadata can be when created from certain programs. Download Trustwave's Global Security Report PDF from 2013. Run it in exiftool. What do you see? That's right, the "History" metadata fields will show you not only that the document was saved 497 times, but it will also show you the exact times that is was saved, the program used to save it each time, and the Document Instance ID for each save (less exciting).
While you have that open, take a look at the creation date (2013:02:22 11:21:34-06:00) and modify date (2013:05:09 10:47:39-07:00). The modify date is much later, but the last "History" save on the file was 2013:02:22 11:18:06-06:00. What's up with that? This is because the PDF was modified in a different program; one newer than InDesign CS5.5. How do I know this? Well, look at the XMP Core version. The XMP Core version used for InDesign CS5.5 is "Adobe XMP Core 5.4-c005 78.147326, 2012/08/23-13:03:03." I just so happen to have a PDF created with InDesign CS6 and that PDF uses "Adobe XMP Core 5.3-c011 66.145661, 2012/02/06-14:56:27." How can it be that CS5.5 is using a later XMP Core version than CS6?! Because another program was used to modify the CS5.5 PDF after the last save. On 2013:05:09 10:47:39-07:00 (the modify date), some program (let's just say it's Acrobat to satisfy my example from before) modified the PDF. The XMP Core version shown in the metadata is NOT from CS5.5.
Also from the 'History' metadata, we can tell that the creation date is actually "2012:12:29 11:20:49-06:00." and NOT "2013:02:22 11:21:34-06:00." My guess is that InDesign was keeping track of the saves, but when it came down to exporting the PDF, it tacked on the export date as the "Create Date" (as the last 'History' save of the file is 3 minutes before the alleged "Create Date").
Remember: the amount of metadata that a program uses when creating files is limitless. XMP is built on XML, so any metadata tags can be defined. Let's take a real-world example of how powerful PDF metadata can be when created from certain programs. Download Trustwave's Global Security Report PDF from 2013. Run it in exiftool. What do you see? That's right, the "History" metadata fields will show you not only that the document was saved 497 times, but it will also show you the exact times that is was saved, the program used to save it each time, and the Document Instance ID for each save (less exciting).
While you have that open, take a look at the creation date (2013:02:22 11:21:34-06:00) and modify date (2013:05:09 10:47:39-07:00). The modify date is much later, but the last "History" save on the file was 2013:02:22 11:18:06-06:00. What's up with that? This is because the PDF was modified in a different program; one newer than InDesign CS5.5. How do I know this? Well, look at the XMP Core version. The XMP Core version used for InDesign CS5.5 is "Adobe XMP Core 5.4-c005 78.147326, 2012/08/23-13:03:03." I just so happen to have a PDF created with InDesign CS6 and that PDF uses "Adobe XMP Core 5.3-c011 66.145661, 2012/02/06-14:56:27." How can it be that CS5.5 is using a later XMP Core version than CS6?! Because another program was used to modify the CS5.5 PDF after the last save. On 2013:05:09 10:47:39-07:00 (the modify date), some program (let's just say it's Acrobat to satisfy my example from before) modified the PDF. The XMP Core version shown in the metadata is NOT from CS5.5.
Also from the 'History' metadata, we can tell that the creation date is actually "2012:12:29 11:20:49-06:00." and NOT "2013:02:22 11:21:34-06:00." My guess is that InDesign was keeping track of the saves, but when it came down to exporting the PDF, it tacked on the export date as the "Create Date" (as the last 'History' save of the file is 3 minutes before the alleged "Create Date").
If we really wanted to, we could use another
metadata field (the PDF version) to further pinpoint the program used.
If the PDF version is 1.7, we could look for programs on a suspect
computer that save PDFs to version 1.7 by default. Believe it or not,
many programs still save PDFs as version 1.4, 1.5, and 1.6.
----
-4n6k