Tracing through a page-break style-inheritance problem with Office 2007 SP2 ODF

ODF as a Get-Out-Of-Jail-Free Card?

By Rick Jelliffe
May 26, 2009 | Comments: 9

I finally downloaded Office 2007 SP2, the upgrade to Office 2007 to give first class ODF support, and decided to try a simple experiment. When I load the ODF 1.1 standard (the .ODT version from Ecma), what does it look like?

I don't want to give oxygen to fires lit by agitating spin doctors, but it is appropriate to check SP2's implementation. So I'll preface this by imagining four classes of fidelity that a document from one word processor might have in another: not all errors are equally important:

  1. Draft. All significant text is present, in the right order, with no omissions or additions. Autonumbering and internal cross-references should work.
  2. Rich text. All text is present, with rich text and graphics and tables and headings. Like HTML. Relative style relationships should roughly hold. (Old-timers can think of galley proofs.)
  3. Publishing quality. Styles should completely follow the stylesheet. Page-formatting features and auto-generated text should work. Arcane typesetting features should work. The page count should be within +/- 10% of the original. Features for which there is no direct equivalent should be simulated as best as possible.
  4. Facsimile. The document opens with, to all intents and purposes, the same formatting with the line and page breaking and page count.

When looking at the ODF loading, it looked like the Draft and Rich Text levels of fidelity had been met. I could not find any examples of missing text, the numbering appeared correct, the basic rich text seemed right, headings were appropriately marked, tables and grey backgrounds looked OK, and so on.

There is of course no chance of a Facsimile level of fidelity, but I do not think it unreasonable to expect a Publishing quality level. SP2 does not quite deliver it.

Just the briefest scan showed two errors of this kind (I commend this as a good test case if anyone else cares to follow up with other issues):

  • Inherited page breaks wrong
  • No line numbering on quoted schemas

I decided to trace through the page break issue, since this is one that would definitely come up in conversion jobs.

What?

I first opened ODF1.1 in OpenOffice 2.4 and 3.1. Both opened with 720 pages.

I open the ODF 1.1 in Office 2007 with SP2. It registered 1,982 pages. WTF?

Looking at it, it was clear that every heading was being preceded by a page break.

So time to look inside the ODF file.


Why?


Looking at the first occurrences, I decide to pick the page break at the 1.1 Notation heading.

Tracing through the styles in Office, it is a Heading 2, which is based on Heading 1. And Heading 1 has the property Page break before set.

Now I look at the ODT file. I open it up in ZIP, and see that the ODF is:


<text:p text:style-name="Text_20_body">
The
<text:user-field-get text:name="CommitteeName">OpenDocument</text:user-field-get>
format makes use of a package concept. These packages are described in chapter
<text:reference-ref text:reference-format="chapter" text:ref-name="Package">17</text:reference-ref>
.
</text:p>
<text:list text:style-name="Outline" text:continue-numbering="true">
<text:list-item>
<text:list text:continue-numbering="true">
<text:list-item>
<text:h text:style-name="P5" text:outline-level="2">Notation
</text:list-item>
</text:list>
</text:list-item>
</text:list>

First of all, I quickly rule out that the break is caused by the preceding paragraph: the style is just a vanilla paragraph.

We find style P5 defined in in the same content.xml

 <style:style style:name="P5" style:family="paragraph" style:parent-style-name="Heading_20_2">
  <style:text-properties fo:language="en" fo:country="US" /> 
  </style:style>

We look in styles.xml for those styles.

<style:style style:name="Heading_20_1"
     style:display-name="Heading 1" 
    style:family="paragraph" 
    style:parent-style-name="Standard" 
    style:next-style-name="Standard" 
    style:list-style-name="Outline" 
    style:class="text" 
    style:master-page-name="" 
    style:default-outline-level="1">
  <style:paragraph-properties 
        fo:margin-left="0cm" 
        fo:margin-right="0cm" 
        fo:margin-top="0.847cm" 
        fo:margin-bottom="0.212cm" 
        fo:text-indent="0cm" 
        style:auto-text-indent="false" 
        style:page-number="auto" 
        fo:break-before="page" 
        fo:padding-left="0cm" 
        fo:padding-right="0cm" 
        fo:padding-top="0.212cm" 
        fo:padding-bottom="0cm" 
        fo:border-left="none" 
        fo:border-right="none" 
        fo:border-top="0.002cm solid #000000" 
        fo:border-bottom="none" fo:keep-with-next="always" /> 
  <style:text-properties 
        fo:color="#000099" 
        fo:font-size="18pt" 
        fo:font-weight="bold" 
        style:letter-kerning="true" 
        style:font-size-asian="18pt" 
        style:font-weight-asian="bold" 
        style:font-name-complex="Arial" 
        style:font-size-complex="16pt" 
        style:font-weight-complex="bold" /> 
  </style:style>

<style:style
style:name="Heading_20_2"
style:display-name="Heading 2"
style:family="paragraph"
style:parent-style-name="Heading_20_1"
style:next-style-name="Standard"
style:list-style-name="Outline"
style:class="text"
style:default-outline-level="2">
<style:paragraph-properties
fo:margin-left="0cm"
fo:margin-right="0cm"
fo:margin-top="0.423cm"
fo:margin-bottom="0.212cm"
fo:text-indent="0cm"
style:auto-text-indent="false"
fo:break-before="auto"
fo:break-after="auto"
fo:padding="0cm"
fo:border="none">
<style:tab-stops>
<style:tab-stop style:position="0.635cm" />
</style:tab-stops>
</style:paragraph-properties>
<style:text-properties
fo:font-size="14pt"
style:font-size-asian="14pt"
style:font-size-complex="14pt"
style:font-style-complex="italic"
style:font-weight-complex="normal" />
</style:style>

So the issue is that the grandparent style Heading_20_1 sets fo:break-before="page" and this is overridden by the parent style "Heading_20_2 which sets fo:break-before="auto".


Which is right?


So what does fo:break-before="auto" mean. In the ODF 1.1 spec:

15.5.22 Break Before and Break After

Use the fo:break-before and fo:break-after properties to insert a page or column break before or after a paragraph. See §7.19.1 and §7.19.2 of [XSL] for details. The values odd-page and even-page are not supported.

Ok, so lets go look at XSL. The reference is

[XSL]W3C, Extensible Stylesheet Language (XSL), 
http://www.w3.org/TR/2001/REC-xsl-20011015/, W3C, 2001.

The XSL spec is s7.19.2

7.19.2 "break-before"

XSL Definition:
Value: auto | column | page | even-page | odd-page | inherit
Initial: auto
Applies to: block-level formatting objects, fo:list-item, and fo:table-row.
Inherited: no
Percentages: N/A
Media: visual

Values have the following meanings:

auto
No break shall be forced.

NOTE:
Page breaks may occur as determined by the formatter's
processing as affected by the "widow", "orphan", "keep-with-next",
"keep-with-previous", and "keep-together" properties

That seems rather clear. There should not be a page break.


How come?

So the next step is to look at Microsoft's Implementer's Notes. These were something that I really welcomed, and I think they show a sign of Microsoft's increasing maturity: decades ago I was really impressed that engineering-cultured companies like Hewlette-Packard actually printed books of the bugs in their current UNIX offerings. It should be really helpful.

Navigating through the notes, we see that the note on s15.5.22 says


The standard defines the property "auto", contained within the attribute fo:break-before, contained within the element <style:paragraph-properties>. This property is supported in core Word 2007.

So according to the Implementer Notes, auto should be supported. But what does Word thinks "auto" means? Lets look at the OOXML standard to see the equivalent. OOXML does not have a single equivalent, it just has the 17.3.1.23 pageBreakBefore empty element.

What seems to have happened is that the implementer has assumed that "auto" meant "inherit" when it in fact is resets page breaking to its normal behaviour. It looks like a bug to me.

How this mistake could have occurred? It suggests that there is a deadline issue at Microsoft that is running directly counter to their needs for quality in delivery of standards.

It would be highly ironic if the Implementers Notes system actually has been their undoing. Normally it would be beyond credibility that no-one would have opened up the ODF 1.1 specification when implementing ODF 1.1 and therefore noticed the problem. But I wonder whether they sliced it into pieces, as HTML or whatever, as part of their implementation tracking system, and always referred to that? Speculation, but stranger things have happened. ODF 1.1 needs to be part of their regression tests.

(I didn't trace through the reason for the lack of line numbers on schema fragments. The implementers notes for ODF mention that Office supports this feature, so it looks like a bug or incompleteness.)

How to fix

Actually the fix is trivial.

In the Home tab of the Ribbon, click on the little box at the bottom of the Styles chunk. This will open the styles list on the left side of the document.

Click your mouse in the offending heading at 1.1 Notation to move the cursor there. The Heading 2 style will be highlighted in the Styles list.

Right-click on Heading 2 and select modify. A box will come up to say what the style is. In the Format button at the bottom, select Paragraph, then the Line and Page Breaks tab. Deselect Page Break Before and save your way out.

This will not only make Heading 2 correct, but fix all the other headings derived from it.

The page count now? A credible 681, only 39 different from the Open Office.


ODF as a Get-Out-Of-Jail-Free Card

While it takes a few steps, it looks like the standard is clear here. Obviously the SP2 behaviour is different from OpenOffice, and doesn't follow the ODF standard for what the markup says.

And now comes the ODF killer. Just when I thought everything was simple, the ODF 1.1 standard's shoddy (in patches) drafting and poor review kicks in. I left out the second paragraph of ODF 1.1 s 15.5.22 on break-before and break-after. It says:

These two properties are mutually exclusive. If they are used simultaneously, the result is undefined.

Now, I bet that this was supposed to mean that if the previous paragraph had a break-after and the current one has a break-before, then it is application-defined what happens. (This alone is enough to make pagination problems enough to fail my Publishing quality criteria above, even if we have conforming implementations. But it does reflect the reality that different systems have different resolution mechanisms that are sometimes difficult to override.) But that is not what it says.

And, sure enough, when we look again at style Heading_20_2 it does indeed have settings for both break-before and break after. This is a get-out-of-gaol-free (jail) card for implementers, in this case Microsoft, but it can be someone else next time.

Standards are difficult. They require review and maintenance, not the blind pressing ahead with new features. A new major implementation of the standard often reveals unsatisfactory parts of the standard. I expect ODF will be improved as more problems or surprises are revealed in Microsoft's implementation and traced to their causes.

But Microsoft should fix break-before="auto".

(I welcome corrections and other technical interpretations of what has gone on here w.r.t. interpretation of ODF 1.1, especially ones that are even vaguely plausible. Is there something I have overlooked?)


You might also be interested in:

9 Comments

Surely it's only a get-out if the bug *only* appears when both before and after are set. Otherwise you could edit the document slightly (remove fo:break-after="auto" on the h2 style, which appears superfluous anyway) and still have a problem because 'auto' and 'inherit' (should) mean two different things.

The ODF spec punting on what happens when a break is specified before and after seems a bit crazy, but a more minor lack of clarity is that (I assume) 'auto' doesn't count as a "use" of the property for allowing an implementation to do its own thing. If it's set on both then do whatever you would do if neither had the style applied, if it's set on only one then do whatever the other one specifies.

Dave: I think you are quite right.

Though I would read "Use" to mean that the attribute has been specified rather than it has particular value.

But the more interpretations, the more the text needs to be clarified. People who disparage review and maintenance of standards (believe it or not, they exist) need their heads read.

Rick, this is a great post. Thanks for doing the analysis that it took to provide it.

I heartily agree that armchair fixing is not called for but a recognition that the ODF specification needs to be read more critically and the ambiguities that creep into implementations be recognized as confirmation where a rigorous treatment is required.

As for the implementation notes, I too find that extremely promising and valuable. In the next round, I expect to see evolution of precision here too. As you saw, it is insufficient to see that a provision is supported or is not supported. What we want to know is in what way is a feature supported or not supported, including the edge cases and what happens when the unsupported appears or the supported provision deviates from the expected/provided-for.

orcmid: Thanks.

In part, I wanted to make a review that showed that it is possible to note problems without it being a pretext for accusations.

And in my mind was the idea that a review, without a model of the class of the error (e.g. above, is it the bottom-line Draft quality, etc.?), without indication of how difficult it is to workaround the problem (e.g. above, what menu steps), and without an indeication of the extent to which this may be a systematic problem (e.g., above, the standard), does not really go far enough to be very informative.

While in the longer term, people will come by Google for the particular technical issue, in the short term, readers will be more interested in evaluating of SP2 and other ODF applications. Simplistic pass/fail judgments look like disguised mudslinging: of course tit requires tat in the undignified marketplace of ideas to a certain extent, but not always.

For example, I think the recent spreadsheet discussions would have been helped by making up classes of fidelity for spreadsheets. For example that Draft was the transfer of data values but not calculation of fomulae, that Rich was data and formula with recalculation, that Industrial was the presentation of graphics etc.

From that POV, SP2 provides at least Draft quality for spreadsheets, but the absence of Open Formula means it does not reach the Rich level except with its own ecosystem (which provides the Industrial level of fidelity, by definition, assuming that ODF does have bits missing that prevents Facsimile which is OOXML's goal.)

It would at least move us away from "that is no good"..."no it is good" biffs towards "that is some good" ... "but not good enough for this" IYSWIM.

Rick,

I agree with Orcmid, this is a great post. I've talked to some people the Word dev team, and they agree that the issue with break-before="auto" not overriding the parent style’s page break attribute is a bug. They have a fix in testing already, and we plan to release it as part of an upcoming update to Office 2007.

As for your line numbering question, it looks like the ODF 1.1 spec uses a line numbering style that Word 2007 SP2 does not support. But I have to spend a little more time digging into that to figure out the details.

I noticed several 'automatic' values in the ODF spec without the value effect being explicitly defined.

I hope that is improved in the 1.2 spec

hAl: Please report them to the ODF TC comments list if you can. If that is to much trouble, put them in a comment here and I will forward them.

I don't think it is fair to criticize ODF as merely a dump format for OpenOffice. I think it is just needing to make the transition from an "exchange" format (where a lot of details don't matter) to being an "industrial" format, in the terms of my "classes of fidelity" blog.

Working on KOffice I've just encountered a similar problem if not the very same.

I loaded a test doument and got way too many page breaks that shouldn't be there. However in my case it wasn't because i misinterpret auto.

But rather because those styles didn't have the break-before value set to anything, and i assumed i should inherit.

However break-before (accroding to 1.2 spec) doesn't support the value inherit and the default value is auto.

When I fixed KOffice to not inherit the break-before value from the parent style the test document loaded fine.

Casper: Well done!

I am very hopeful that ODF 1.2 will be tighter than ODF 1.0 in several of these issues. And I expect it will not be too much work to adjust implementations to all agree. All these things just take time.

(I think many of the larger, mature open source projects may be having a problem, in that there are so few new C++ programmers. I went straight from C/LISP/Omnimark to Java with little C++ in between, so when I have looked at participating in KOffice or AbiWord, it looks too much out of my comfort zone. This is supposed to be an encouragement to keep on improving KOffice!)

News Topics

Recommended for You

Got a Question?