The PDF specification (
ISO approved copy of the ISO 32000-1 Standards document) is authoritative and quite readable, so don't be intimidated by it. Here are a few particularly helpful sections to check when trying to write a PDF file:
Syntaxsection, in particular the
strings). The Document Structure subsection explains the dictionary entries in
Type 1 fontssub-subsection of the
Simple Fontssubsection within the
Fontdictionaries. TrueType font dictionaries have mostly the same entries.
Example PDF Filesannex
The specification also discusses PDF's more advanced features, like color profiles, embedded 3D drawings, and active content.
On the topic of color profiles, avoid viewing Annex L (Edit: recent versions of libpoppler are better, even if still a little slow compared to Adobe Reader.)
Colour Palettes ; it takes forever to render, at least using libpoppler in Ubuntu.
Note that bitmap graphics in the output file will still be compressed, but all the text and vector graphics commands will be decompressed.ps2pdf -dCompressPages=false in.pdf out.pdf
PDF's ancestor PostScript is a full stack-based programming language. Commands are run in PostScript by pushing arguments on the stack (for example
1 2) and then invoking operators (for example
add). Although PDF does not have all of the capabilities of PostScript, it too works by pushing arguments on the stack and invoking operators.
the PDF interpreter first pushes 0 and 0 on the stack. It then calls0 0 Td (Hello hello.) Tj
Tdpops the top two items from the stack and uses them to set the text display position. In the second line, the interpreter pushes the string
Hello hello.and calls
Tjto pop the string and write it on the page.
Take for example a minimal PDF file. It is specified as a version 1.1 PDF file because it is limited to features available in 1.1. The second line comment contains 6 high bit characters (displayed as 3 characters in UTF-8 encoding), as required by the
File Header subsection of the specification (Section 7.5.2):
If a PDF file contains binary data, as most do (see 7.2, "Lexical Conventions"), the header line shall be immediately followed by a comment line containing at least four binary characters—that is, characters whose codes are 128 or greater. This ensures proper behaviour of file transfer applications that inspect data near the beginning of a file to determine whether to treat the file’s contents as text or as binary.
As seen in the minimal PDF file, only 4
indirect objects are required to display a single page of text: an indirect
Catalog dictionary, an indirect
Pages dictionary, an indirect
Page dictionary, and an indirect
stream. Indirect objects are accessed by reference. A reference looks like
2 0 R. 2 is the object number, 0 is the
generation number, and
R is the operator. In contrast, direct objects appear directly inline. Any datum in the body that is not a reference to an indirect object is a direct object.
For the most part, direct objects and indirect objects are interchangeable, but not always. None of the 4 indirect objects in the minimal file, nor the direct
Length dictionary may be replaced. Converting a direct object to an indirect object allows easy reuse of that object anywhere in the PDF file.
Why can't the 4 indirect objects in the minimal PDF file be direct objects? For the
stream object, the reason is that the specification states that
all streams shall be indirect objects. For the others, the reason is that they are used as values in special dictionaries that contain a
Type key. Dictionaries with a defined
Type have rules about what other key–value pairs they may contain. One possible rule is that some of the values in the dictionary must be indirect objects. Each of the 3 indirect dictionary objects in the minimal file is at some point used as a value subject to this rule. For example the specification says that the value paired with the
Pages key in the
shall be an indirect reference :
Key Type Value Pages dictionary (Required; shall be an indirect reference) The page tree node that shall be the root of the documents page tree (see 7.7.3, "Page Tree").
Note that PDF
objects are just data; they are not OOP
objects. For more details on objects and special dictionaries, see the specification subsections
Document Structure within the
1 2 3 4 5 6
xref 0 5 0000000000 65535 f 0000000018 00000 n 0000000077 00000 n 0000000178 00000 n 0000000457 00000 n
The cross-reference table tells the PDF viewer where to find the indirect objects in the document. The first line indicates the number of the first indirect object in the table, which in this case is the special object numbered 0 (the second example xref table below might help clarify this). It also gives the total number of objects in the table (five). The next line is for object 0, the head of the linked list of free objects (see the specification for more information). The third line describes object 1: the byte offset of the object relative to the beginning of the file, the generation number, the letter
n, and a two-byte end-of-line (for example
space + linefeed). Each successive line is implicitly counted as the next object, so the fourth line in this example is object 2. The letters indicate whether an object is free (f) or in-use (n). In a brand-new PDF file, for objects other than 0, the generation number will be 0 and the letter will be
n. Using PDF's built-in mechanism to non-destructively overwrite information will increase the generation numbers of outdated objects and change their letters to
1 2 3 4 5 6 7
xref 0 2 0000000000 65535 f 0000000018 00000 n 2 3 0000000077 00000 n 0000000178 00000 n 0000000457 00000 n
This example shows that an xref table may contain multiple subsections, each started by a header line that consists of the starting object number and the total number of objects in the subsection. This particular table has two subsections: one starting at object 0 and one starting at object 2. The first subsection contains 2 objects, and the second contains 3.
Cross-Reference Tablesection under the
File Structuresection of the
Updating Examplefrom the
Example PDF Filesannex
trailer << /Root 1 0 R /Size 5 >> startxref 565 %%EOF
The trailer dictionary must contain at least the
Size entries. The
Root must be an indirect
Catalog dictionary. This
Catalog tells the PDF viewer where to find the various
Pages objects that make up the document's displayable content. The
Size must not be indirect, and is just a count of the number of indirect objects in the file. The
startxref tells the PDF viewer the byte offset of the most recent
carriage return(not allowed as the EOL directly after the
CR + LF
Newlineon Wikipedia and the
Character Setsubsection of the spec.)
space + LF
space + CR
CR + LF
0-indexed byte counts are required for the
xref table, the
startxref, and the
Length of streams. Which bytes are included in the count and where do the counts start?
For indirect objects, the correct byte to use is the first byte of the
obj line, which is often the object number.
1 0 obj
Indirect object number 1 is located at the 18th byte of the file.
The number of bytes from the beginning of the line following the keyword
streamto the last byte just before the keyword
endstream. (There may be an additional EOL marker, preceding
endstream, that is not included in the count and is not logically part of the stream data.)
There should be an end-of-line marker after the data and before
endstream; this marker shall not be included in the stream length.
Example (assuming the EOL is
|Cumulative byte count|
stream BT /F1 18 Tf 0 0 Td (Hello hello.) Tj ET endstream
0 3 13 20 21 22 23 41 44 45 46 47 47 47
The correct count is 47. Notice in particular that the last end-of-line before
endstream is not counted.
To make things a little easier, one can calculate the length by subtracting the byte offset of the first character on the line after
BT in this example) from the byte offset of the last character before the end-of-line that precedes
endstream, and then adding 1.
xrefsection, estimate stream lengths, and use pdftk to fix the PDF file. Install
It might not be able to handle every case.pdftk foo.pdf output fixedfoo.pdf
vim -b, and do something like…
…to put the current line, character, byte offset, and file position in the ruler. Vim calls the first byte 1, so subtract 1 from the counts to get correct numbers. For some reason, when I used Vim on certain pre-made PDF files it didn't include every newline in the count.:set rulerformat=%30(%4l,%4c,%4o%12P%)
(thanks to this manual; also see a few more hexdump examples). This will print the whole file on one line with each byte represented by one printing character so character counts can be used as byte counts.hexdump -v -e '/1 "%_p"'
|10 Jul 2017||Correct
fiveto account for the 24 Dec 2012 edit.
|13 Sep 2015||Correct link to
Portable Document Format: An Introduction for Programmers
|03 May 2014||Mention
|08 Jan 2013||Comments link|
|24 Dec 2012||Update to match the corrected minimal PDF file: all streams must be indirect objects.|
|27 Jan 2012||small wording and formatting changes|
|26 Jan 2012||corrected the string drawing command to Tj|
|04 Sep 2011||proofreading, xref table subsections, updated ps2pdf and Ghostscript links|
|07 Apr 2011||xref table line numbers were misaligned|
|22 Jan 2011||using ps2pdf to decompress PDF files|
|02 Dec 2010||a little more about the trailer, some markup change around code snippets|
|20 Nov 2010||clean up|
|07 Oct 2010||Change revisions from a
|03 Oct 2010||Minor formatting change|
|13 Sep 2010||First-ish version|