Word to HTML to CSV Delimitation Primer Tutorial

Word to HTML to CSV Delimitation Primer Tutorial

Word to HTML to CSV Delimitation Primer Tutorial

The modern document applications allow conversion to HTML. What happens during that process, exactly? Well, that’s “under the hood” stuff. A little background, though, and context …

  • Why would you want to convert, say a Word file, to HTML (using, perhaps, LibreOffice, in our case, or Microsoft Word)? … well, as a mere mortal programmer …
  • (any form of) text is easier to deal with for “mere mortal programmer” languages we might want to use like …
  • PHP … is very good at the delimiter processing bits that allow the programmer be useful …
  • converting … the data into other guises, the one that interested us was …
  • CSV (comma separated value) data … to be fed into spreadsheet applications like Excel or LibreOffice’s one … and then create charts

… and to do useful delimiter work in PHP you need to know, or suss out, “what happens”, or evidence of that … think hex dumps (where $dr is a PHP variable containing an HTML file record) …

<?php

echo bin2hex($dr) . "\n";
// ... gave, in our case, output such as ...
// c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020546f74616c207c20c2a020c2a02036302c30333220c2a020c2a020c2a03130302e3030

?>

And so we line up all the useful contributors …

  1. CP3O
  2. C2A0
  3. R2D2
  4. … … …

Hang on?! What’s with C2A0? And for that matter, the pitiful “am typing” simulation “… … … “?!

Well, we asked around, and got to this useful link telling us these are non-ascii characters describing a …


Non-breaking space

… scenario programmers of HTML will know can be those …


&nbsp;

… HTML entities in your webpage content. Well, now, at least to us, that all makes sense. But, for our job, that could be the tip of the “UTF-8 headache” situation! We know we’re only interested in ascii data characters for the conversion job we are trying to do. Is there a way to simplify this “middleperson” HTML data content? Well, this other useful link … got us to use …

<?php

$dr=preg_replace('/[\x7F-\xFF]/ui', '', $dr);

?>

… helped us with …

  1. sanity
  2. simplification

… as far as the PHP delimitation logic went. This was an inhouse job, but we’ll show you a skeletal of how we used …

  • input Word report … we are calling from_word_to_html.html … say …
  • containing spreadsheetable data …
  • we wanted to extract into …
  • individual CSV files … ready to …
  • open as useful spreadsheets … and perhaps onto some chart production …
  • processing via command line command …

    php dostuff.php

    … where that PHP is (very informally) …
  • dostuff.php

… in case these ideas interest you?!

If this was interesting you may be interested in this too.

This entry was posted in eLearning, Operating System, Tutorials and tagged , , , , , , , , , , , , , , , , , , , , , , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *