simple-pdf

pdf 的格式，基本上是一個純文字的檔案，只要用文字編輯器，就能建立一個 pdf 的檔案。下面是一個只顥示 "Hello World" 的最精簡的pdf 檔。

%PDF-1.2

1 0 obj

<<

 /Type /Page

 /Parent 5 0 R

 /Resources 3 0 R

 /Contents 2 0 R

>>

endobj

2 0 obj

<<

  /Length 51

>>

 stream

BT

  /F1 24 Tf

  1 0 0 1 260 254 Tm

  (Hello World)Tj

ET

 endstream

endobj

3 0 obj

<<

 /ProcSet[/PDF/Text]

 /Font <</F1 4 0 R >>

>>

endobj

4 0 obj

<<

 /Type /Font

 /Subtype /Type1

 /Name /F1

 /BaseFont/Helvetica

>>

endobj

5 0 obj

<<

 /Type /Pages

 /Kids [ 1 0 R ]

 /Count 1

 /MediaBox

   [ 0 0 612 446 ]

>>

endobj

6 0 obj

<<

 /Type /Catalog

 /Pages 5 0 R

>>

endobj

trailer

<<

 /Root 6 0 R

>>

p.1

Introduction to the Insides of PDF

Welcome to our look inside PDF. We will literally be looking inside a set of example PDF files. We will be going over a lot of details but the real objective is to let you carry away a general understanding of what kinds of things are inside of PDF files and how they are organized and relate to one another. So don't let the details discourage you.

The actual example files are a companion to this slide presentation.

You can view the example files with Adobe Reader or Acrobat Standard or Professional. You can also look inside the files yourself by using any text editor like Notepad, Microsoft Word, Emacs, etc. If you really get into it you can make small changes to the files and then display them with Acrobat to see the effect.

-- 4/26/2005

p.2

References

* PDF Language Reference Manual (1.5 & 1.4)

http://partners.adobe.com/asn/tech/pdf

(under Technologies/PDF Reference)

* PDF/A Draft Specifications

http://www.aiim.org/standards

(Select PDF-Archive, ISO PDF-Archive, CD)

* PDF Tutorial (Inside PDF)

http://home.comcast.net/~jk05/presentations

In June 1993 Adobe announced Acrobat and the first documentation on PDF was available from Addison Wesley Publishing Company called the Portable Document Format Language Reference Manual. Those books are still for sale and a new one just came out called the 5th Edition.

That 5th Edition documents PDF version 1.6 and it can be found in electronic format on Adobe's World Wide Web site as noted on the slide.

Information about the archiving subset of PDF, PDF/A can be found on the AIIM site as shown.

This presentation together with the example PDF files will be on my private Web site (Comcast) as shown. You will find several other tutorial presentations of mine a that site as well.

p.3

PDFs are Composite Documents

Collecting many instances of many things

Page contents Metadata

Images Annotations

Graphics Links

Fonts Digital signatures

Colorspaces <and more>

More and more I am beginning to look at PDF files as composite documents. PDFs are self contained complete documents that are made from images, graphics, text blocks, and have document features like table of contents, digital signatures, etc.

p.4

Composite Documents

These are just two examples to display what I mean by composite documents.

The one of the left has several images down the left hand side. I has a company graphic at the top right and it has nicely laid out text.

The right one has a complicated graphic, a logo and a form to fill out on the bottom.

Composite docments are made from distinct document pieces of different types.

p.5

The Basics

Inside PDF

Now we step back and go through the basics of exactly how a PDF file is put together. It is really very simple, yet very powerful.

p.6

Looking Inside

* PDF files are made from “objects”

* Objects are numbered

* Objects can occur in any order in a file

* Objects can refer to each other by number

* References can create a cross-linked set of objects

(mathematical graph)

* Cross reference table maps object numbers to places within the file

This slide is pretty self describing if you read it from top to bottom. We will see exactly what objects look like in the next 50 slides or so.

p.7

This is a sample page that is a take off on the sample programs people write to output "Hello World". Here we have a page (8.5 x 6.19 inches) laid out to match the format of the rest of the slides.

We might more normally in the US have a page that is 8.5 x 11 inches. We will see how the page size is determined.

The words "Hello World" are presented roughly in the middle of the page in 24 point Helvetica.

p.8

We will see a lot of pages like this one so let me tell you carefully what you are looking at. I opened the example 01 PDF file in Microsoft Word as a "text" file. The characters from the PDF file have been formatted into three columns and I have also inserted some line breaks and indentations in order to make the text more readable. I also added headings and footings including page numbers.

There are sections highlighted in colors so that your attention is drawn to the parts we will be discussing. If you do a text copy from this page and put it into a simple text page, then save it as a file, that file will be a PDF document that will display Hello World. If you really try this with Word, make sure that you do a "saveas" and select "text only with line breaks" as the format. Do not expect this to work if you save it as a Word document.

Since the page displays "Hello World" you might expect the PDF file to have that character string somewhere within it. You would be right and I have highlighted this in red. We will work our way outward from this string and see what supporting material is required to turn that text string into a complete PDF document. Notice that "Hello World" is enclosed in parenthesis. This is how strings are represented in both PDF and PostScript.

p.9

Note that the material in the file is organized into 6 objects a "%PDF-1.2" header and a trailer.

Each of the objects has a number followed by a zero, begins with "obj" and ends with "endobj".

I have often referred to PDF as "object oriented PostScript" because this object structuring is something that PDF has but PostScript does not.

Note also that the file starts with "%PDF-1.2" which indicates that this is a PDF file following the 1.2 version of the PDF specification. Since the current products support PDF 1.6 I could have put that on the first line. But there is nothing in this file that was not supported by the very early products.

p.10

Objects are numbered and the second numbers shown here as zeroes are the object's version numbers. This is because it is possible to update PDF files "inplace" without copying the whole file and sometimes it is necessary to indicate that this is a new version of a particular object. So object "1 2" would be a newer version of object 1 than would be "1 0".

I have highlighted two ways in which the object numbers are used. The simple arrowheads indicate the object definitions whereas the full arrows show where references to the object are found within other objects. The "R" is a notation that indicates that the two proceeding numbers form a reference to an object.

The PDF document actually starts at the end of the file at the "trailer" where we see that the "/Root" of the document is object 6. If we look at object 6 we see that it is the documents "/Catalog" and it refers to the document's "/Pages" as object 5. Object 5 then has a reference to a single "/Kids" page at object 1 which is of "/Type" "/Page".

p.11

Objects Form a Graph

/Root → (6 0 R) /Catalog →(5 0 R) /Pages → (1 0 R) /Page → (2 0 R) /Contents

(1 0 R) /Page → (3 0 R) /Resources → (4 0 R) /Font

Here is a pictorial diagram of the initial structure required of all PDF files. The Root always points to the Catalog which points to the Pages node. The Pages node has one or more Kids nodes which represent the individual pages in the document. Note that the Page nodes each point back to the Pages node. Each Page node will subsequently point further to other objects that make up each page. We will be looking at those objects a little later on. When we use the term "points to" or "refers to" we mean the linking established by the object numbers and references to them from within other objects.

P.12

Before we get bogged down in more details of what is in a PDF file, let's look at another simple example (02).

This is the same "Hello World" page but if you look closely you will see that the words are only 50% gray, not completely black. You might want to zoom in to take a closer look.

P.13

Hello World - 50% Gray

2 0 obj

<<

  /Length 51

>>

 stream

BT

  /F1 24 Tf

  1 0 0 1 260 254 Tm

  0.5 g

  (Hello World)Tj

ET

 endstream

endobj

Again, no surprises. There is one new thing in this file that wasn't in the first: "0.5 g" which stands for 50% gray.

The default color is set to 100% gray or black at the start of each page. As long as black is what is wanted no further color settings are needed as was the case in the first example.

Note an important point about PDF. The setting of things like color and fonts are reset to the defaults at the start of each page. Settings on one page have no effect on setting on other pages. This gives up page independence, something that PostScript doesn't do.

p.14

Postfix Notation

operand-1 operand-2 operand-3 operator

3 4 add --> 7

3 4 add 3 mult --> 21

scale-x skew-x skew-y scale-y move-x move-y Tm

1 0 0 1 260 254 Tm

gray-level g

0.5 g

string Tj

(Hello World) Tj

font-internal-name size Tf

/F1 24 Tf

Well, we will be look at more and more of the constructs found in the PDF files so let us stop for a minute and talk about the notation that is being used.

Both PostScript and PDF use a notation known as "postfix". (That is where the name PostScript came from.) Postfix is an traditional mathematical notation sometimes call reverse "Polish" notation. The unique thing about this notation is that the "action" indicator, more often called the "operator", comes at the right hand end of the expression and complex expressions can be written without the use of parenthesis. There is also a notation known as "prefix" notation where the operator comes first. The normal notation we usually use (e.g., a + b) is called "infix" notation because the operator is between or within the expression.

Postfix notation is sometimes used by hand calculators.

We have already seen the "0.5 g" expression where the number 0.5 is the first and only operand and the operator is "g". We have also seen the text string which is indicated by enclosing the characters in parenthesis which becomes the single operand to the operator "Tj". The "T" in this operator must be capitalized and it stands for "text". PostScript also uses 0.5 g but instead of "Tj" PostScript uses "show". So the example in PostScript would look like "(Hello World) show". We have also seen the object reference operator "R" which takes the object number and version number as its two preceding operands. I have come to look at "5 0 R" as one single thing -- a pointer off to object 5.

p.15

Hello World - 50% Gray

2 0 obj

<<

  /Length 51

>>

 stream

BT

  /F1 24 Tf

  1 0 0 1 260 254 Tm

  0.5 g

  (Hello World)Tj

ET

 endstream

endobj

Continuing to work our way up from the "Hello World" string, we find the operator "Tm". This stands for "text matrix" and takes the six preceding numbers as its arguments. Those six numbers allow us to indicate scaling, skewing, rotating and moving of the material on the page or drawing surface. Note that the last two numbers are 260 and 254. This indicates where on the page we want the Hello World string to occur. The first four numbers (1 0 0 1) indicate that no rotation, scaling or skewing is to occur.

p.16

Make a mental note of where on the page the Hello World string is placed. In particular think of the point just to the left of the "H" of Hello and at the characters baseline. Think of this position in relation to the bottom left corner of the drawing surface.

P.17

The default measurement system in both PostScript and PDF is in units of 1/72 of an inch and the starting point is in the lower left corner of the page or drawing area.

So for an 8.5 by 11 inch page there are 612 units horizontally and 792 units vertically. Think of graph paper.

We are using pages that are 8.5 by 6.19 inches.

This means that the two numbers we saw in the "Tm" operation indicate the horizontal (X) offset from the left edge of the page (260) and the vertical (Y) offset from the bottom (254). So our string is to be placed 260/612 -ths in from the left and 254/466 -ths up from the bottom.

The number 446 is what you get rounded from multiplying 6.19 inches by 72 units to the inch.

P.18

Hello World - 50% Gray

2 0 obj

<<

  /Length 51

>>

 stream

BT

  /F1 24 Tf

  1 0 0 1 260 254 Tm

  0.5 g

  (Hello World)Tj

ET

 endstream

endobj

Again, working up from "Hello World" we encounter the "Tf" operator which appears to have two preceding arguments "/F1" an arbitrary name and "24" a number. "Tf" stands for "text font" and so it is safe to assume that the 24 means the point size. But what about the "/F1" which we said was an arbitrary "name"?

P.19

Basic PDF Building Blocks

(Hello World) ........... Strings

-- enclosed in ( )

12 4.55 .............. Numbers

/Dog Names

[ /Dog 12 (a word) ] Arrays

-- enclosed in [ ]

<< /Size 12 /Color /red >> Dictionaries

-- enclosed in << >>

1 0 R 5 2 R 129 0 R ... References (pointers)

Well, lets go back to some more basics. This time we look at the building blocks of all PDF files. Most of these constructs are also used in PostScript. You will learn quite a lot about PostScript here if you are not already familiar with it.

We mentioned strings already -- contained within parenthesis. There are several more rules for forming strings like what do you do when you need to show parenthesis. We leave those details to the be looked up in the PDF Reference Manual.

Numbers are strings of digits. In PDF and PostScript they can have decimal points. In PostScript one can also use scientific notation like 1.5E-06, but that is not allowed in PDF.

Names or labels always begin with a "/" (slash) character. They are used as "keys" in dictionaries and in various ways to denote or label concepts and things.

Arrays are ordered, numbered, lists of things and always begin with a "[" (left square bracket) and end with a matching "]" (right square bracket). Between the brackets are the array elements separated by white space. The elements can be of any type, mixed or not. That is, PDF and PostScript support "heterogeneous" arrays. The elements are implicitly numbered from left to right starting with zero.

One of the most used and most powerful constructs used in PDF (and also in PostScript) is the "dictionary". These begin with "<<" (two less thans) and end with a matching ">>" (two greater thans). In between them are pairs of items, the first of each pair being a name. The second element of each pair can be any single construct. The first of each pair is the "key" and the second is the "value". So dictionaries provide name/value pairs, associative stores, or whatever other term you might have heard of where you can save and look up values by a key or name.

We already mentioned the "references" which use the postfix operator "R" and have the object number and version as the preceding arguments.

P.20

Nesting

<<

     /Name (Jim)

     /Age 39

     /Children [(Heather) (Timothy) (Rebecca)]

>>

     44.55

<<

     /MORE [ 22 33 44 55 1 ]

     /LESS [ (dog) (cat) (mule) ]

     /count 88

>>

The basic elements shown on the preceding page can be put together to form more complicated structures by nesting elements inside one another.

The first example is an array that contains three items. Item zero is a dictionary, item 1 is 22 and item 2 is 44.55. The dictionary itself contains three entries. The first is the Name which has a string (Jim) as a value, Age which has the number 39 as a value and Children which has an array of the three strings (Heather) (Timothy) and (Rebecca) the names of my three children. And I'm 39 too!

The next example is a dictionary containing three entries, MORE which has as value an array of 5 numbers, LESS which has as value an array of three strings, and count which has a value of 88.

Note that upper and lower case letters are always significant in PDF and PostScript. So "More" is different from "MORE" and "count" is not the same as "Count".

P.21

Objects

3 0 obj

(a string object)

endobj

Direct versus Indirect Objects

<< /dog (a labrador) >>

can also be

<< /dog 4 0 R >>

...

4 0 obj

(a labrador)

endobj

We have already talked some about PDF objects. This is a construct that the PostScript Language does NOT have. The ability to randomly access portions of PDF a document and the page independence of PDF pages are based on the object structuring of PDF.

Objects always begin with the "obj" operator which takes two preceding numbers and ends with the "endobj" (end object) operator. In between, can be any one of the previously noted elements: a number, a string, a name, a reference, a dictionary or an array. On the next page we will see another new PDF construct, the "stream" that can also be within an object.

Notice that once objects and references to objects have been introduced, in many cases, there are alternative ways in which to represent more complex structures either using nesting with what are called "direct" objects or remotely via "indirect" objects. This is shown where the dictionary can contain the string (a labrador) nested within it or alternatively could nest only the reference to the string "4 0 R" and the string becomes an independent and indirect object somewhere else in the file.

The power of indirect objects is that they can be shared. That is, more than one reference can refer to any given object. If the object is large this can be a great space saver over repeating the object within each containing (referencing) object.

P.22

Stream Objects

55 1 obj

<<

    /Length 31

    /Type /Content

    /special (true)

>>

stream

   this is the stream’s content

endstream

endobj

The last construct we will talk about is the "stream". Streams are objects containing a dictionary and a stream of material contained between the delimiters "stream" and "endstream". The dictionary must contain the key "/Length" followed by the stream length in bytes.

Streams are used in many ways in PDF documents. One important use is to hold sampled image data. Images sampled at very high resolutions can be extremely large (e.g., 50 megabytes). The bytes don't fit the model for a simple string, array, etc. The stream is the place where potentially large and perhaps unstructured material or material requiring a structure different from the basic object structure can be placed.

This construct allows implementations to have two distinct methods for obtaining material out of the file. The objects are usually relatively small and can be read in their entirety. Streams are the place to put data that doesn't fit this model.

P.23

We now return to our 50% gray "Hello World" example after having learned more about the details of the PDF notation.

We had been working our way up from the Hello World string and had arrived at the "Tf" operator which takes two preceding operands. The 24 is the font point size. The /F1 is an arbitrary made-up name for the font. Having the full long font names within the page content repeated every time there is a font change would be pretty bulky. So the design allow for a made up short font name (e.g., /F1) to be used to refer off to a "font resource". All resource are referenced via a "resource dictionary" that is associated with each page.

The page content object number 2 is pointed to by the /Page object number 1 by means of the /Contents key. You will notice that the /Page object also has a /Resources key which, in this case, references object 3. If we then look at object 3 we see that it has a key /Font which has as its value a dictionary with the made up name /F1 which then refers to object 4. Looking at object 4 we see that it is an object of /Type /Font with a /Subtype of /Type1 (the Adobe PostScript font format) and it also has a key indicating the font is a /Basefont and is that one named /Helvetica. Base Fonts are the base 13 fonts of PostScript and must be built-in to any PDF processing program. In this way the font definition need no further information to define the font being used. In the case of fonts not built-in, this font object is more complex and points off to character with tables, encoding vectors, and the actual font outlines.

Other types supported besides /Type1 include /TrueType and /Type3 and subsets of them.

Let us finish off this page content object completely. We have discussed the operators "Tj", "g", "Tm" and "Tf". That leaves the "BT" or begin text and the "ET" or end text operator. This is another place where PDF differs from PostScript. In PDF the BT and ET form a text block which is not allowed to contain graphics or image operators. It was felt that being able to find and isolate text blocks would help in reading and updating PDF files. PostScript does not have this segregation of text -- all imaging operators are at the same level.

That completes the description of all of the content operators within this page contents stream. Streams were discussed earlier. The contents of each page is held within a stream as in this object 2.

P.24

Onward to example 3! We see that our "Hello World" string has become red.

P.25

Hello World - Device Dependent 100% Red

We see the "rg" operator highlighted.

One of the design criteria for PDF was to make the files as small as possible. So one letter abbreviations were used for operator names. When most of the one letter suggestive abbreviations were exhausted the designers turned to two letter abbreviations. So "rg" is the abbreviation for "rgb" or red, green, blue.

This is a device dependent color request asking for 100% of the red colorant to be used and zero percent of the green and blue. The three arguments preceding the "rg" operator indicate the percentage of red, green and blue, respectively. Those numbers range from 0 (none) to 1 (100%) and can contain decimal points (e.g., 0.45 would indicate to use of 45% of that colorant).

P.26

Another red "Hello World"?

P.27

Hello World -- L*a*b* Red

This time we show how device independent color designations can be used in PDF. There are two operators used in this example: "cs" which is the abbreviation for "set the color space" and "sc" which is the abbreviation for "set the color within the 'current' color space".

Looking first at the "cs" operator we see that it takes one preceding argument which is a made up name /CS1. This was a name of my choosing. It could just as easily been /mycolorspace or /X or whatever.

Color spaces are considered a resource just like fonts so they are also looked up in the resources dictionary -- in this example, object 3. We see that this dictionary contains and entry for /Colorspace which has a dictionary as its value. Within that dictionary we find the key /CS1 which has as its value the color space specification for our color space as an array. A more readable form of the colorspace definition is given on the next page.

The first element of the array is the name /Lab which defines this color space to be the CIE L*a*b* color space. The next element of the array is a dictionary that defines two keys /Range and /WhitePoint. The /Range key has as it value an array of four numbers giving the range of a* and b* values respectively. The value of L* is always assumed to be between 0 and 100..

The /WhitePoint key has as its value an array of 3 values providing the CIE XYZ values, respectively, of the diffuse white point of the color space.

The "ri" operator is next after the "cs" operator (back to the page content object 2) and it stands for "rendering intent". Those of you seriously into the color management topic will recognize this as deterimining the rendering intent of "AbsoluteColormetric" which is one of 4 choices for how to do gamut compression and other adjustments. The single argument to "ri" is the name of a rendering intent.

Next the "sc" or set color operator has as its three preceding operands, the value of L*, a* and b*, respectively. Trust me, 63, 127, 127 is a red color in L*a*b*.

P.28

Color spaces are considered a resource just like fonts so they are also looked up in the resources dictionary -- in this example object 3. We see that this dictionary contains an entry for /Colorspace which has a dictionary as its value. Within that dictionary we find the key /CS1 which has as its value the color space specification for our color space as an array.

The /WhitePoint key has as its value an array of 3 values providing the CIE XYZ values, respectively, of the diffuse white point of the color space.

P.29

A third red "Hello World"?

P.30

This time we also specify a red color using a device independent color space -- this one is a "calibrated RGB" color space.

Again, we have a made-up name for the color space /CS2 and it can be looked up in the resources dictionary -- object 3.

Since this color space is an rgb one, the "sc" set color operator has as its arguments three numbers for red, green, and blue, respectively.

In this case, red -- 100%, green -- 0%,and blue 0%.

To get a better formatted view of the /CS2 color space go to the next page.

Note that this example does not have any "ri" or rendering intent specified. In this case it will use the "default" setting which is for "RelativeColormetric" processing.

P.31

This is a "calibrated RGB" color space indicated by the name /CalRGB being the first element of the /CS2 array. The second element is a dictionary containing the properties that would distinguish this particular RGB color space from all other RGB color spaces. The properties enumerated here are the gammas for each of red, green and blue, respectively (all 2.222 in this case) and the diffuse white point of the color space given as a CIE XYZ value. In addition, this definition has a /Matrix key whose value is an array of nine elements. These elements are three groups of three numbers and specify the transformation of an rgb value in this space, after gamma correction to the CIE XYZ standard. The first three numbers are for red, the next three for green and the last three for blue.

P.32

Finally, no more red "Hello World"! This time we have added a blue star to the black "Hello World" page of example 01. You might want to zoom in to take a closer look at the star.

P. 33

Blue Star

2 0 obj

<< /Length 51 >>

 stream

BT

  /F1 24 Tf

  1 0 0 1 260 254 Tm

  (Hello World)Tj

ET

  0 0 1 rg

  315 226 m

  299 182 l

  339 208 l

  291 208 l

  331 182 l

 endstream

endobj

The blue star is "drawn" using line graphics operators "l", "m" and "f". This is a very compact and expressive way to create lots of colored objects quickly. Graphics expressed this way also rotate, scale and move with no degradation and are completely resolution independent. This is a feature of PDF and PostScript and the degree to which it is precise and expressive is rather unique to these Adobe languages.

We see in the page content object 2, the familiar "Hello World" string. After the ET operator that ends the Hello World text block we have some new lines of material highlighted. These lines represent the graphics blue star.

Right off we can see why the star is blue since the "rg" operator has arguments specifying 0% red, 0% green and 100% blue.

To look further into how this "graphic" star is drawn we elaborate on the next page.

P.34

  0 0 1 rg   % 0 0 1 setrgbcolor

  315 226 m  % 315 226 moveto

  299 182 l  % 299 182 lineto

  339 208 l  % 339 208 lineto

  291 208 l  % 291 208 lineto

  331 182 l  % 331 182 lineto

  f          % fill

We already discussed the "graph paper" coordinate system that is the default one used in PostScript and PDF. The origin or zero, zero point is the lower left corner of the drawing area (page) and the units of measure by default are 1/72 of an inch. Both of these things can be changed as we will see in a subsequent example.

In the lower left corner of this page, the graphics operators that draw the blue star are reproduces from example 6. After the operator "rg" that causes the star to be device dependent blue, we see the "m" or "moveto" operator. In PostScript the operator actually is "moveto" but in PDF we try to conserve space so just us "m".

The two arguments to the "m" operator are the distance to move horizontally (X) and the distance to move vertically (Y) from the "current point" which is initially set to (0, 0) at the start of each page.

This moves us to the starting point on the page for drawing our star. Next are four uses of the operator "l" which is the abbreviation for "lineto" meaning imagine a line from the current point to the new point and move the current point, as well. These 4 "l" operators then sketch out a path of four lines giving shape to the star. Notice two things: first so far we have not really drawn anything on the page -- just sketched out an imaginary shape, and second we didn't draw the last or fifth line needed to completely specify the start shape.

The last operator in the graphics group is the "f" operator. That stands for fill. Since the starting point of our imaginary path is not the same as the ending point, the "f" operator adds the fifth imaginary line itself and then it fills in the shape with the current (blue) color.

P.35

This page and the next show the complete Blue Star example.

P.36

All of the examples shown in this presentation are complete in that all objects are shown. In this case we have already seen the roll that these objects play in example 6 so we just move on to the next example.

P.37

OK. Now we have the same star but it is only outlined in cyan instead of being filled with blue.

Just to add a little more complexity to the page we returned to one of the red "Hello World" versions. In fact, this example is based on example 4 which is a device independent red using the L*a*b* color space.

Again, you might want to zoom in for a closer look at the "stroked" not filled star. (Is that something like James Bond's stirred not shaken?)

P.60

Hello World, Star, Logo -- Page Contents Compressed

2 0 obj

<<

/Filter /FlateDecode

/Length 452

>>

stream

ÄäÄ—ybC060D„!¥B+*`Ñ2úDsFÖ1ƒd7í…ƒ34PH2õÜÒ\fir6%CP4ãå")T£6S#Ä⁄î@S!ÉFc®Äj7§

DÜCëÕXd8*C:Ë‘k±V‡ˆkEéà_¥ÉLPÍFÌAãUF’Åé4B$¡‚ò÷H¬†Ñ"Å@8£à‰C)

ÿ“c2ëOEF»4íD#tZ=&âpˇ‘ju+ç.å]¹uj¶N¹~àZ¢ŸnT[m‚5©pŸ57ªTßrî·Îó¸míˇì¹

xÛ5O-~ëc“Ÿ,zò&flcT€Ì¶°€$flUç„’1òØé7³Ä¯ÁÈ¯fl.³Ü‚„m6<p˘ëØ£tfiåÔ{F˜7.

lJAì‘—Økê⁄pn3 —<N3ú⁄ò.—ªÆ®àïNaT“ƒÆ4Nà@¹ÃF“Ö¥K]æÆSô

¥è;fl ¥¥cç¥ëãdf¶±ke¥—ªk7/³D¸DF4•)¥£Ér˛ÖCQ…0³E#;–°E!5N}--

Ò·7'¡)=€D‚æo‹—$7OE4…-KÑlR‡'Ã

endstream

endobj

The unprintable characters making up most of this page are compressed data for the page contents object.

In example 12 the page contents is a stream containing the "Hello World" string and printing controls for the star and the Adobe logo. All of these things are still in the page content stream here but now compressed using the Flate compression technique. This is the technology used to make PNG files. You will note that the stream dictionary now notes that a /Filter is needed to read the stream and it is the /FlateDecode filter.

If you look back at example 12 you will note that the stream length is 1343 whereas this compressed representation of the stream is only 452 bytes long. That is roughly a 3 to 1 reduction in size. This is an important technique for keeping PDF file sizes small. Unfortunately for the curious it makes reading the files with a text processor like we have been doing not very helpful.

P.65

Hello World, Star, Logo -- With Cross Reference

xref

0 8

0000000000 65535 f

0000000016 00000 n

0000000102 00000 n

0000000626 00000 n

0000000947 00000 n

0000001033 00000 n

0000001124 00000 n

0000001177 00000 n

trailer

<</Size 8

  /Info 7 0 R

  /Root 6 0 R

  /ID[

  <516b0039a4e03b90b

    d0a72f349225b02>

  <516b0039a4e03b90b

    d0a72f349225b02>]

>>

startxref

%%EOF

This is a cross reference table for this example file.

One design requirement for PDF was to allow a viewer to read, say, page 501 in a document of, say, 1000 pages without reading all the file first and without reading pages 1-500 first either. In order to do this there must be some index or cross reference material someplace that indicates which bytes in the file are to be associated with which object. This is the function of the cross reference table shown.

We have already seen a little of this material, namely the "trailer <</Root 6 0 R>>" section. The viewer just does not know where to start without this Root object number. However, given that the objects all have "obj" and "endobj" around them the viewers can read through the whole file and build a cross reference if need be. This is what the file repair amounts to.

The cross reference has a dummy entry for object zero at the start and then seven more, one for each of the objects in the file. So for example the second line is "0000000016 0000 n" which indicates that object one (the position of this line in the list) starts at the 16th byte position in the file. The "00000" is the generation number, being zero for all of the objects in this file. The "n" indicates that this is a valid entry in the table. Note that the first entry has an "f" indicating that it is a dummy non-usable entry.

Other things to note: the Information dictionary (object 7) is referred to by the trailer dictionary. The trailer dictionary also has an /ID key that has an array of two values as its value. These strings represent an unique identification of this file so it can be identified no matter what the file name is changed to. This is important for applications like document data bases and document management systems.

The "startxref" has a number after it indicating where in the file the cross reference table can be found and the "%%EOF" indicates the end of the file.

There is more to all this and the real student will find the details in the PDF Reference Manual.

Google Sites

Report abuse