Skip to content

Commit 7301f7d

Browse files
committed
Html Reader Process Titles as Headings Not Paragraphs
Fix PHPOffice#1692. Builds on work started some time ago by @0b10011, to whom primary credit is due. Html Reader does not process the `head` section of the document, and, in particular, does not process its `style` section. It will, however, process inline styles, so 0b10011's model of adding the title as a text run (with styles) will work well once this change is applied. However, that model would not deal with the alternative method of assigning a Title Style, and just adding the title as text. In order to accommodate that, I have removed the declaration of heading font styles in the head section, and now generate them all inline in the body. This has the added benefit of being able to read the doc as html, then saving it as docx, preserving, at least in part, any user-defined font styles. Note that html does have pre-defined title styles, but docx does not. @constip suggests in the original issue that margin top and bottom are being applied too frequently. I believe that was addressed by recently merged PR PHPOffice#2475. It is also suggested that the `*` css selector be dropped in favor of `body`. 2475 added the body selector. I agree that this renders the `*` selector unnecessary, and, as stated in the issue, it can cause problems. This PR drops that selector. It is also suggested that `loadHTML` be used instead of `loadXML`. This is not as easy a change as it seems, because loadHTML uses ISO-8859-1 charset rather than UTF-8, so I will not attempt that change.
1 parent 11a7aaa commit 7301f7d

File tree

8 files changed

+155
-52
lines changed

8 files changed

+155
-52
lines changed

docs/changes/1.x/1.3.0.md

+15
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# [1.3.0](https://github.com/PHPOffice/PHPWord/tree/1.3.0) (WIP)
2+
3+
[Full Changelog](https://github.com/PHPOffice/PHPWord/compare/1.2.0...1.3.0)
4+
5+
## Enhancements
6+
7+
### Bug fixes
8+
9+
- MsDoc Reader : Correct Font Size Calculation by [@oleibman](https://github.com/oleibman) Issue [#2526](https://github.com/PHPOffice/PHPWord/issues/2526) PR [#2531](https://github.com/PHPOffice/PHPWord/pull/2531)
10+
- Html Reader : Process Titles as Headings not Paragraphs [@0b10011](https://github.com/0b10011) and [@oleibman](https://github.com/oleibman) Issue [#1692](https://github.com/PHPOffice/PHPWord/issues/1692) PR [#2533](https://github.com/PHPOffice/PHPWord/pull/2533)
11+
12+
### Miscellaneous
13+
14+
15+
### BC Breaks

src/PhpWord/Shared/Html.php

+18-16
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@
2525
use PhpOffice\PhpWord\Element\AbstractContainer;
2626
use PhpOffice\PhpWord\Element\Row;
2727
use PhpOffice\PhpWord\Element\Table;
28+
use PhpOffice\PhpWord\Element\TextRun;
2829
use PhpOffice\PhpWord\Settings;
2930
use PhpOffice\PhpWord\SimpleType\Jc;
3031
use PhpOffice\PhpWord\SimpleType\NumberFormat;
@@ -206,12 +207,12 @@ protected static function parseNode($node, $element, $styles = [], $data = []):
206207
$nodes = [
207208
// $method $node $element $styles $data $argument1 $argument2
208209
'p' => ['Paragraph', $node, $element, $styles, null, null, null],
209-
'h1' => ['Heading', null, $element, $styles, null, 'Heading1', null],
210-
'h2' => ['Heading', null, $element, $styles, null, 'Heading2', null],
211-
'h3' => ['Heading', null, $element, $styles, null, 'Heading3', null],
212-
'h4' => ['Heading', null, $element, $styles, null, 'Heading4', null],
213-
'h5' => ['Heading', null, $element, $styles, null, 'Heading5', null],
214-
'h6' => ['Heading', null, $element, $styles, null, 'Heading6', null],
210+
'h1' => ['Heading', $node, $element, $styles, null, 'Heading1', null],
211+
'h2' => ['Heading', $node, $element, $styles, null, 'Heading2', null],
212+
'h3' => ['Heading', $node, $element, $styles, null, 'Heading3', null],
213+
'h4' => ['Heading', $node, $element, $styles, null, 'Heading4', null],
214+
'h5' => ['Heading', $node, $element, $styles, null, 'Heading5', null],
215+
'h6' => ['Heading', $node, $element, $styles, null, 'Heading6', null],
215216
'#text' => ['Text', $node, $element, $styles, null, null, null],
216217
'strong' => ['Property', null, null, $styles, null, 'bold', true],
217218
'b' => ['Property', null, null, $styles, null, 'bold', true],
@@ -337,21 +338,22 @@ protected static function parseInput($node, $element, &$styles): void
337338
/**
338339
* Parse heading node.
339340
*
340-
* @param \PhpOffice\PhpWord\Element\AbstractContainer $element
341-
* @param array &$styles
342-
* @param string $argument1 Name of heading style
343-
*
344-
* @return \PhpOffice\PhpWord\Element\TextRun
345-
*
346341
* @todo Think of a clever way of defining header styles, now it is only based on the assumption, that
347342
* Heading1 - Heading6 are already defined somewhere
348343
*/
349-
protected static function parseHeading($element, &$styles, $argument1)
344+
protected static function parseHeading(DOMNode $node, AbstractContainer $element, array &$styles, string $headingStyle): TextRun
350345
{
351-
$styles['paragraph'] = $argument1;
352-
$newElement = $element->addTextRun($styles['paragraph']);
346+
self::parseInlineStyle($node, $styles['font']);
347+
// Create a TextRun to hold styles and text
348+
$styles['paragraph'] = $headingStyle;
349+
$textRun = new TextRun($styles['paragraph']);
353350

354-
return $newElement;
351+
// Create a title with level corresponding to number in heading style
352+
// (Eg, Heading1 = 1)
353+
$element->addTitle($textRun, (int) ltrim($headingStyle, 'Heading'));
354+
355+
// Return TextRun so children are parsed
356+
return $textRun;
355357
}
356358

357359
/**

src/PhpWord/Writer/HTML/Element/Title.php

+7-1
Original file line numberDiff line numberDiff line change
@@ -46,8 +46,14 @@ public function write()
4646
$writer = new Container($this->parentWriter, $text);
4747
$text = $writer->write();
4848
}
49+
$css = '';
50+
$style = \PhpOffice\PhpWord\Style::getStyle('Heading_' . $this->element->getDepth());
51+
if ($style !== null) {
52+
$styleWriter = new \PhpOffice\PhpWord\Writer\HTML\Style\Font($style);
53+
$css = ' style="' . $styleWriter->write() . '"';
54+
}
4955

50-
$content = "<{$tag}>{$text}</{$tag}>" . PHP_EOL;
56+
$content = "<{$tag}{$css}>{$text}</{$tag}>" . PHP_EOL;
5157

5258
return $content;
5359
}

src/PhpWord/Writer/HTML/Part/Head.php

+3-4
Original file line numberDiff line numberDiff line change
@@ -90,17 +90,16 @@ private function writeStyles(): string
9090
'font-family' => $this->getFontFamily(Settings::getDefaultFontName(), $this->getParentWriter()->getDefaultGenericFont()),
9191
'font-size' => Settings::getDefaultFontSize() . 'pt',
9292
];
93-
// Mpdf sometimes needs separate tag for body; doesn't harm others.
94-
$bodyarray = $astarray;
9593

9694
$defaultWhiteSpace = $this->getParentWriter()->getDefaultWhiteSpace();
9795
if ($defaultWhiteSpace) {
9896
$astarray['white-space'] = $defaultWhiteSpace;
9997
}
98+
$bodyarray = $astarray;
10099

101100
foreach ([
102101
'body' => $bodyarray,
103-
'*' => $astarray,
102+
//'*' => $astarray,
104103
'a.NoteRef' => [
105104
'text-decoration' => 'none',
106105
],
@@ -137,8 +136,8 @@ private function writeStyles(): string
137136
$style = $styleParagraph;
138137
} else {
139138
$name = '.' . $name;
139+
$css .= "{$name} {" . $styleWriter->write() . '}' . PHP_EOL;
140140
}
141-
$css .= "{$name} {" . $styleWriter->write() . '}' . PHP_EOL;
142141
}
143142
if ($style instanceof Paragraph) {
144143
$styleWriter = new ParagraphStyleWriter($style);
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
<?php
2+
/**
3+
* This file is part of PHPWord - A pure PHP library for reading and writing
4+
* word processing documents.
5+
*
6+
* PHPWord is free software distributed under the terms of the GNU Lesser
7+
* General Public License version 3 as published by the Free Software Foundation.
8+
*
9+
* For the full copyright and license information, please read the LICENSE
10+
* file that was distributed with this source code. For the full list of
11+
* contributors, visit https://github.com/PHPOffice/PHPWord/contributors.
12+
*
13+
* @see https://github.com/PHPOffice/PHPWord
14+
*
15+
* @license http://www.gnu.org/licenses/lgpl.txt LGPL version 3
16+
*/
17+
18+
namespace PhpOffice\PhpWordTests\Shared;
19+
20+
use PhpOffice\PhpWord\Element\TextRun;
21+
use PhpOffice\PhpWord\PhpWord;
22+
use PhpOffice\PhpWord\Settings;
23+
use PhpOffice\PhpWord\Shared\Html as SharedHtml;
24+
use PhpOffice\PhpWord\Writer\HTML as HtmlWriter;
25+
use PHPUnit\Framework\TestCase;
26+
27+
/**
28+
* Test class for PhpOffice\PhpWord\Shared\Html.
29+
*
30+
* @coversDefaultClass \PhpOffice\PhpWord\Shared\Html
31+
*/
32+
class HtmlHeadingsTest extends TestCase
33+
{
34+
public function testRoundTripHeadings(): void
35+
{
36+
Settings::setOutputEscapingEnabled(true);
37+
$originalDoc = new PhpWord();
38+
$originalDoc->addTitleStyle(1, ['size' => 20]);
39+
$section = $originalDoc->addSection();
40+
$expectedStrings = [];
41+
$section->addTitle('Title 1', 1);
42+
$expectedStrings[] = '<h1 style="font-size: 20pt;">Title 1</h1>';
43+
for ($i = 2; $i <= 6; ++$i) {
44+
$textRun = new TextRun();
45+
$textRun->addText('Title ');
46+
$textRun->addText("$i", ['italic' => true]);
47+
$section->addTitle($textRun, $i);
48+
$expectedStrings[] = "<h$i>Title <span style=\"font-style: italic;\">$i</span></h$i>";
49+
}
50+
$writer = new HtmlWriter($originalDoc);
51+
$content = $writer->getContent();
52+
foreach ($expectedStrings as $expectedString) {
53+
self::assertStringContainsString($expectedString, $content);
54+
}
55+
56+
$newDoc = new PhpWord();
57+
$newSection = $newDoc->addSection();
58+
SharedHtml::addHtml($newSection, $content, true);
59+
$newWriter = new HtmlWriter($newDoc);
60+
$newContent = $newWriter->getContent();
61+
// Reader transforms Text to TextRun,
62+
// but result is functionally the same.
63+
$firstStringAsTextRun = '<h1><span style="font-size: 20pt;">Title 1</span></h1>';
64+
self::assertSame($content, str_replace($firstStringAsTextRun, $expectedStrings[0], $newContent));
65+
}
66+
}

tests/PhpWordTests/Writer/HTML/FontTest.php

+28-28
Original file line numberDiff line numberDiff line change
@@ -84,23 +84,23 @@ public function testFontNames1(): void
8484
self::assertEquals('style5', Helper::getTextContent($xpath, '/html/body/div/p[6]/span', 'class'));
8585

8686
$style = Helper::getTextContent($xpath, '/html/head/style');
87-
$prg = preg_match('/^[*][^\\r\\n]*/m', $style, $matches);
88-
self::assertNotFalse($prg);
89-
self::assertEquals('* {font-family: \'Courier New\'; font-size: 12pt;}', $matches[0]);
87+
$prg = preg_match('/^body[^\\r\\n]*/m', $style, $matches);
88+
self::assertSame(1, $prg);
89+
self::assertEquals('body {font-family: \'Courier New\'; font-size: 12pt;}', $matches[0]);
9090
$prg = preg_match('/^[.]style1[^\\r\\n]*/m', $style, $matches);
91-
self::assertNotFalse($prg);
91+
self::assertSame(1, $prg);
9292
self::assertEquals('.style1 {font-family: \'Tahoma\'; font-size: 10pt; color: #1B2232; font-weight: bold;}', $matches[0]);
9393
$prg = preg_match('/^[.]style2[^\\r\\n]*/m', $style, $matches);
94-
self::assertNotFalse($prg);
94+
self::assertSame(1, $prg);
9595
self::assertEquals('.style2 {font-family: \'Arial\'; font-size: 10pt;}', $matches[0]);
9696
$prg = preg_match('/^[.]style3[^\\r\\n]*/m', $style, $matches);
97-
self::assertNotFalse($prg);
97+
self::assertSame(1, $prg);
9898
self::assertEquals('.style3 {font-family: \'hack attempt&#039;}; display:none\'; font-size: 10pt;}', $matches[0]);
9999
$prg = preg_match('/^[.]style4[^\\r\\n]*/m', $style, $matches);
100-
self::assertNotFalse($prg);
100+
self::assertSame(1, $prg);
101101
self::assertEquals('.style4 {font-family: \'padmaa 1.1\'; font-size: 10pt; font-weight: bold;}', $matches[0]);
102102
$prg = preg_match('/^[.]style5[^\\r\\n]*/m', $style, $matches);
103-
self::assertNotFalse($prg);
103+
self::assertSame(1, $prg);
104104
self::assertEquals('.style5 {font-family: \'MingLiU-ExtB\'; font-size: 10pt; font-weight: bold;}', $matches[0]);
105105
}
106106

@@ -134,20 +134,20 @@ public function testFontNames2(): void
134134
self::assertEquals('style4', Helper::getTextContent($xpath, '/html/body/div/p[5]/span', 'class'));
135135

136136
$style = Helper::getTextContent($xpath, '/html/head/style');
137-
$prg = preg_match('/^[*][^\\r\\n]*/m', $style, $matches);
138-
self::assertNotFalse($prg);
139-
self::assertEquals('* {font-family: \'Courier New\'; font-size: 12pt;}', $matches[0]);
137+
$prg = preg_match('/^body[^\\r\\n]*/m', $style, $matches);
138+
self::assertSame(1, $prg);
139+
self::assertEquals('body {font-family: \'Courier New\'; font-size: 12pt;}', $matches[0]);
140140
$prg = preg_match('/^[.]style1[^\\r\\n]*/m', $style, $matches);
141-
self::assertNotFalse($prg);
141+
self::assertSame(1, $prg);
142142
self::assertEquals('.style1 {font-family: \'Tahoma\'; font-size: 10pt; color: #1B2232; font-weight: bold;}', $matches[0]);
143143
$prg = preg_match('/^[.]style2[^\\r\\n]*/m', $style, $matches);
144-
self::assertNotFalse($prg);
144+
self::assertSame(1, $prg);
145145
self::assertEquals('.style2 {font-family: \'Arial\', sans-serif; font-size: 10pt;}', $matches[0]);
146146
$prg = preg_match('/^[.]style3[^\\r\\n]*/m', $style, $matches);
147-
self::assertNotFalse($prg);
147+
self::assertSame(1, $prg);
148148
self::assertEquals('.style3 {font-family: \'DejaVu Sans Monospace\', monospace; font-size: 10pt;}', $matches[0]);
149149
$prg = preg_match('/^[.]style4[^\\r\\n]*/m', $style, $matches);
150-
self::assertNotFalse($prg);
150+
self::assertSame(1, $prg);
151151
self::assertEquals('.style4 {font-family: \'Arial\'; font-size: 10pt;}', $matches[0]);
152152
}
153153

@@ -181,20 +181,20 @@ public function testFontNames3(): void
181181
self::assertEquals('style4', Helper::getTextContent($xpath, '/html/body/div/p[5]/span', 'class'));
182182

183183
$style = Helper::getTextContent($xpath, '/html/head/style');
184-
$prg = preg_match('/^[*][^\\r\\n]*/m', $style, $matches);
185-
self::assertNotFalse($prg);
186-
self::assertEquals('* {font-family: \'Courier New\', monospace; font-size: 12pt;}', $matches[0]);
184+
$prg = preg_match('/^body[^\\r\\n]*/m', $style, $matches);
185+
self::assertSame(1, $prg);
186+
self::assertEquals('body {font-family: \'Courier New\', monospace; font-size: 12pt;}', $matches[0]);
187187
$prg = preg_match('/^[.]style1[^\\r\\n]*/m', $style, $matches);
188-
self::assertNotFalse($prg);
188+
self::assertSame(1, $prg);
189189
self::assertEquals('.style1 {font-family: \'Tahoma\'; font-size: 10pt; color: #1B2232; font-weight: bold;}', $matches[0]);
190190
$prg = preg_match('/^[.]style2[^\\r\\n]*/m', $style, $matches);
191-
self::assertNotFalse($prg);
191+
self::assertSame(1, $prg);
192192
self::assertEquals('.style2 {font-family: \'Arial\', sans-serif; font-size: 10pt;}', $matches[0]);
193193
$prg = preg_match('/^[.]style3[^\\r\\n]*/m', $style, $matches);
194-
self::assertNotFalse($prg);
194+
self::assertSame(1, $prg);
195195
self::assertEquals('.style3 {font-family: \'DejaVu Sans Monospace\', monospace; font-size: 10pt;}', $matches[0]);
196196
$prg = preg_match('/^[.]style4[^\\r\\n]*/m', $style, $matches);
197-
self::assertNotFalse($prg);
197+
self::assertSame(1, $prg);
198198
self::assertEquals('.style4 {font-family: \'Arial\'; font-size: 10pt;}', $matches[0]);
199199
}
200200

@@ -221,19 +221,19 @@ public function testWhiteSpace(): void
221221
$xpath = new DOMXPath($dom);
222222

223223
$style = Helper::getTextContent($xpath, '/html/head/style');
224-
self::assertNotFalse(preg_match('/^[*][^\\r\\n]*/m', $style, $matches));
225-
self::assertEquals('* {font-family: \'Arial\'; font-size: 12pt; white-space: pre-wrap;}', $matches[0]);
224+
self::assertNotFalse(preg_match('/^body[^\\r\\n]*/m', $style, $matches));
225+
self::assertEquals('body {font-family: \'Arial\'; font-size: 12pt; white-space: pre-wrap;}', $matches[0]);
226226
$prg = preg_match('/^[.]style1[^\\r\\n]*/m', $style, $matches);
227-
self::assertNotFalse($prg);
227+
self::assertSame(1, $prg);
228228
self::assertEquals('.style1 {font-family: \'Courier New\'; font-size: 10pt; white-space: pre-wrap;}', $matches[0]);
229229
$prg = preg_match('/^[.]style2[^\\r\\n]*/m', $style, $matches);
230-
self::assertNotFalse($prg);
230+
self::assertSame(1, $prg);
231231
self::assertEquals('.style2 {font-family: \'Courier New\'; font-size: 10pt;}', $matches[0]);
232232
$prg = preg_match('/^[.]style3[^\\r\\n]*/m', $style, $matches);
233-
self::assertNotFalse($prg);
233+
self::assertSame(1, $prg);
234234
self::assertEquals('.style3 {font-family: \'Courier New\'; font-size: 10pt; white-space: normal;}', $matches[0]);
235235
$prg = preg_match('/^[.]style4[^\\r\\n]*/m', $style, $matches);
236-
self::assertNotFalse($prg);
236+
self::assertSame(1, $prg);
237237
self::assertEquals('.style4 {font-family: \'Courier New\'; font-size: 10pt;}', $matches[0]);
238238
}
239239

tests/PhpWordTests/Writer/HTML/Helper.php

+10-1
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ public static function getNamedItem(DOMXPath $xpath, string $query, string $name
6464
if ($item2 === null) {
6565
self::fail('Unexpected null return requesting item');
6666
} else {
67-
$returnValue = $item2->attributes->getNamedItem($namedItem);
67+
$returnVal = $item2->attributes->getNamedItem($namedItem);
6868
}
6969
}
7070

@@ -94,4 +94,13 @@ public static function getAsHTML(PhpWord $phpWord, string $defaultWhiteSpace = '
9494

9595
return $dom;
9696
}
97+
98+
public static function getHtmlString(PhpWord $phpWord, string $defaultWhiteSpace = '', string $defaultGenericFont = ''): string
99+
{
100+
$htmlWriter = new HTML($phpWord);
101+
$htmlWriter->setDefaultWhiteSpace($defaultWhiteSpace);
102+
$htmlWriter->setDefaultGenericFont($defaultGenericFont);
103+
104+
return $htmlWriter->getContent();
105+
}
97106
}

tests/PhpWordTests/Writer/HTML/PartTest.php

+8-2
Original file line numberDiff line numberDiff line change
@@ -178,11 +178,17 @@ public function testTitleStyles(): void
178178
$xpath = new DOMXPath($dom);
179179

180180
$style = Helper::getTextContent($xpath, '/html/head/style');
181-
self::assertNotFalse(strpos($style, 'h1 {font-family: \'Calibri\'; font-weight: bold;}'));
181+
//self::assertNotFalse(strpos($style, 'h1 {font-family: \'Calibri\'; font-weight: bold;}'));
182182
self::assertNotFalse(strpos($style, 'h1 {margin-top: 0.5pt; margin-bottom: 0.5pt;}'));
183-
self::assertNotFalse(strpos($style, 'h2 {font-family: \'Times New Roman\'; font-style: italic;}'));
183+
//self::assertNotFalse(strpos($style, 'h2 {font-family: \'Times New Roman\'; font-style: italic;}'));
184184
self::assertNotFalse(strpos($style, 'h2 {margin-top: 0.25pt; margin-bottom: 0.25pt;}'));
185185
self::assertEquals(1, Helper::getLength($xpath, '/html/body/div/h1'));
186186
self::assertEquals(2, Helper::getLength($xpath, '/html/body/div/h2'));
187+
// code for getNamedItem had been erroneous
188+
self::assertSame("font-family: 'Calibri'; font-weight: bold;", Helper::getNamedItem($xpath, '/html/body/div/h1', 'style')->textContent);
189+
$html = Helper::getHtmlString($phpWord);
190+
self::assertStringContainsString('<h1 style="font-family: \'Calibri\'; font-weight: bold;">Header 1 #1</h1>', $html);
191+
self::assertStringContainsString('<h2 style="font-family: \'Times New Roman\'; font-style: italic;">Header 2 #1</h2>', $html);
192+
self::assertStringContainsString('<h2 style="font-family: \'Times New Roman\'; font-style: italic;">Header 2 #2</h2>', $html);
187193
}
188194
}

0 commit comments

Comments
 (0)