The modern document applications allow conversion to HTML. What happens during that process, exactly? Well, that’s “under the hood” stuff. A little background, though, and context …
- Why would you want to convert, say a Word file, to HTML (using, perhaps, LibreOffice, in our case, or Microsoft Word)? … well, as a mere mortal programmer …
- (any form of) text is easier to deal with for “mere mortal programmer” languages we might want to use like …
- PHP … is very good at the delimiter processing bits that allow the programmer be useful …
- converting … the data into other guises, the one that interested us was …
- CSV (comma separated value) data … to be fed into spreadsheet applications like Excel or LibreOffice’s one … and then create charts
… and to do useful delimiter work in PHP you need to know, or suss out, “what happens”, or evidence of that … think hex dumps (where $dr is a PHP variable containing an HTML file record) …
<?php
echo bin2hex($dr) . "\n";
// ... gave, in our case, output such as ...
// c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020546f74616c207c20c2a020c2a02036302c30333220c2a020c2a020c2a03130302e3030
?>
And so we line up all the useful contributors …
- CP3O
- C2A0
- R2D2
- … … …
Hang on?! What’s with C2A0? And for that matter, the pitiful “am typing” simulation “… … … “?!
Well, we asked around, and got to this useful link telling us these are non-ascii characters describing a …
Non-breaking space
… scenario programmers of HTML will know can be those …
… HTML entities in your webpage content. Well, now, at least to us, that all makes sense. But, for our job, that could be the tip of the “UTF-8 headache” situation! We know we’re only interested in ascii data characters for the conversion job we are trying to do. Is there a way to simplify this “middleperson” HTML data content? Well, this other useful link … got us to use …
<?php
$dr=preg_replace('/[\x7F-\xFF]/ui', '', $dr);
?>
… helped us with …
- sanity
- simplification
… as far as the PHP delimitation logic went. This was an inhouse job, but we’ll show you a skeletal of how we used …
- input Word report … we are calling from_word_to_html.html … say …
- containing spreadsheetable data …
- we wanted to extract into …
- individual CSV files … ready to …
- open as useful spreadsheets … and perhaps onto some chart production …
- processing via command line command …
php dostuff.php
… where that PHP is (very informally) … - dostuff.php
… in case these ideas interest you?!
If this was interesting you may be interested in this too.