The modern document applications allow conversion to HTML. What happens during that process, exactly? Well, thatโs โunder the hoodโ stuff. A little background, though, and context โฆ
- Why would you want to convert, say a Word file, to HTML (using, perhaps, LibreOffice, in our case, or Microsoft Word)? โฆ well, as a mere mortal programmer โฆ
- (any form of) text is easier to deal with for โmere mortal programmerโ languages we might want to use like โฆ
- PHP โฆ is very good at the delimiter processing bits that allow the programmer be useful โฆ
- converting โฆ the data into other guises, the one that interested us being โฆ
- CSV (comma separated value) data โฆ to be fed into spreadsheet applications like Excel or LibreOfficeโs one โฆ and then create charts
โฆ and to do useful delimiter work in PHP you need to know, or suss out, โwhat happensโ, or evidence of that โฆ think hex dumps (where $dr is a PHP variable containing an HTML file record) โฆ
<?php
echo bin2hex($dr) . "\n";
// ... gave, in our case, output such as ...
// c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020546f74616c207c20c2a020c2a02036302c30333220c2a020c2a020c2a03130302e3030
?>
And so we line up all the useful contributors โฆ
- CP3O
- C2A0
- R2D2
- โฆ โฆ โฆ
Hang on?! Whatโs with C2A0? And for that matter, the pitiful โam typingโ simulation โโฆ โฆ โฆ โ?!
Well, we asked around, and got to this useful link telling us these are non-ascii characters describing a โฆ
Non-breaking space
โฆ scenario programmers of HTML will know can be those โฆ
โฆ HTML entities in your webpage content. Well, now, at least to us, that all makes sense. But, for our job, that could be the tip of the โUTF-8 headacheโ situation! We know weโre only interested in ascii data characters for the conversion job we are trying to do. Is there a way to simplify this โmiddlepersonโ HTML data content? Well, this other useful link โฆ got us to use โฆ
<?php
$dr=preg_replace('/[\x7F-\xFF]/ui', '', $dr);
?>
โฆ helped us with โฆ
- sanity
- simplification
โฆ as far as the PHP delimitation logic went. This was an inhouse job, but weโll show you a skeletal of how we used โฆ
- input Word report โฆ we are calling from_word_to_html.html โฆ say โฆ
- containing spreadsheetable data โฆ
- we wanted to extract into โฆ
- individual CSV files โฆ ready to โฆ
- open as useful spreadsheets โฆ and perhaps onto some chart production โฆ
- processing via command line command โฆ
php dostuff.php
โฆ where that PHP is (very informally) โฆ - dostuff
php
โฆ in case these ideas interest you?!
If this was interesting you may be interested in this too.