Saturday, July 15, 2023

ImmPrep v1.01x / Pandoc / UTF8 & PureBasic

Writing.

I've updated immprep.exe, fixing two small bugs, and adding Word (.docx) support. Note that the command line parameters have changed, to allow for more flexibility, and support for other TTS applications (not yet implemented, but it's easier to do now). When working on Word support I ran into a few issues (for the full story and latest version of ImmPrep read on).


Where to find

If you want a copy of Immprep then drop me a line in the comments.


Background

Microsoft Edge has a (very good) built-in Text-to-Speak "Immersive Reader" (that seems to use Microsoft Azure neural voices) and that is way better than the native Windows 10 / 11 TTS, or (even Worse) the SAPI 5 (?) voice included with Microsoft Word.

But to use it I had to manually save my document as an HTML, then load it into Edge, and then start the Immersive Reader inside Edge. That was very inconvenient.

PureBasic and TouchPortal to the rescue 😁

You'll find my first attempt here, which uses an older version of Immprep but otherwise applies the same concept.


Writing order

A little background on the working order when working on my novel (yeah, I'm trying to be a novelist, without any success so far) to explain why things go wrong:

1. I write in Google Docs (using the 'block' format)

When writing in Google Docs I typically use straight quotes, ie. ' and ". I find curly quotes somewhat annoying when writing :-) I also use -- instead of a real em-dash at this stage.

2. I use Balabolka or Immprep + Edge to do TTS checks

3. ... and keep repeating the above until I'm done

4. When the story is ready I save the Google Docs document to .docx

5. Then I mess around a bit with the format to make it a manuscript

Editors and publishers typically want Word, curly quotes, real em-dashes, and everything in the 'manuscript' format.

6. I do my final edits in Word

7. And use Immprep + Edge to do TTS checks


Pandoc (.docx to .html)

You never have enough tools in your toolbox 😊 

Pandoc is a command line tool that allows conversion from one format to another. It isn't perfect (it's worse than the build-in converters inside Google Docs and Word) but it runs from a command line, and it's free (no need to spend hundreds of dollars, as some other sites try to let you pay for their conversion tools).

You'll find pandoc here:


Pandoc's command line sequence is a little odd, but I settled on:
  • pandoc -s --metadata title=immprep -o "output.html" -f docx "input.html"

Pandoc converts the .docx file without any problems (it's a simple, straightforward text after all, no indexes, only bold / italic, and a few headers) and it creates an UTF8 .html file. When converting, pandoc (or maybe Word already) uses special Unicode numbers for the curly quotes (both single and double), em-dash, and ellipses (three dots). In the UTF file you'll find these 'special characters' as blocks of three numbers:


Other programs may not always recognize these codes, and the conversion of em-dash to a Unicode point instead of the usual html — is a bit weird.


ImmPrep

Processing converted .docx files

These are .docx files converted to .html using Pandoc, contain curly quotes and other UTF8 codes.

Immprep is written in PureBasic. Unfortunately, PureBasic has a decent UTF8 and Unicode support, but it wouldn't handle these codes properly, and when reading and saving the .html files it made a mess of these special codes.

Solution: easy. As I am mostly interested in TTS and not in the 'curly' nature of the quotes, I just replace them with standard ASCII stuff. So now in Immprep I first strip out any of the troublesome UTF8 codes and replace those with the regular 'straight' ASCII codes, as well as — and a simple trinity of dots 😇... After that I do the regular html tag processing.


Processing exported Google Docs files

These are exported as a (zipped) .html file by Google Docs. They typically still contain 'fake' em-dashes using --.

Immprep takes those double hyphens -- and turns them into &emdash; when processing the .html file.


The command line parameters of Immprep have changed!

You can always list these by typing:

C:\> immprep help


Here is the content of the help file, showing the new command line parameters:


Immprep v1.04x - c2023 WackoWare:

  process files to simplify the use of immersive reader in Edge for TTS purposes


Processing .html files:

  immprep html <file> [noimmersive] [emdash] [mode <n>] [cleanup] [edge <path>] [trigger "<text>"]


  1. copy source .html file to immprep.html

  2. process immprep.html

  3. launch edge (optional)


Processing Google Docs .zip files:

  immprep gdocs <path> [noimmersive] [emdash] [anytime] [cleanup] [unread] [mode <n>] [edge <path>] [trigger "<text>"]


  1. look for latest zip file in specified folder

  2. unpack first *.html file from .zip to immprep.html

  3. process immprep.html

  4. clean up (remove zip, optional, careful with anytime)

  5. launch edge (optional)


Processing Word .docx files:

  immprep word <path> [pandoc <path>] [noimmersive] [emdash] [anytime] [cleanup] [mode <n>] [edge <path>] [trigger "<text>"]


  1. pandoc.exe must exist and be found, otherwise this will fail

  2. look for newest *.docx file in specified folder (consider using anytime)

  3. convert file to immprep.html.tmp using pandoc.exe

  4. process immprep.html

  5. launch edge (optional)


Options & Parameters: (either use all with, or all without a preceeding dash)

  help             - shows this text, aborts any other command line processing

  version          - shows more detailed version information

  html <file>      - html source, process specified html file

  word <path>      - word souce, process newest .docx file in specific path, requires pandoc.exe

  gdocs <path>     - gdocs source, process first .html in newest .zip file in specific path

  noimmersive      - don't look for <immersive> markers (see below)

  emdash           - any -- will be converted to &mdash;

  pandoc <path>    - specify path to pandoc.exe, default is nearby, immprep will fail if pandoc.exe is not found

  pandoc ~default~ - assume pandoc.exe in in the same folder as immprep.exe

  edge <path+file> - triggers the use of edge and specifies path to edge

  edge ~default~   - default is "C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe"

  anytime          - don't care how old the zip / word file is (otherwise it needs to be younger than 10 seconds)

  cleanup          - remove old .zip file

  unread           - will allow cleanup even if nothing was sent to edge (careful when combining with anytime!)

  mode <n>         - defines how the source file is processed and result is created (not yet implemented)

  trigger "<text>" - immprep will strip everyting up to the specified text, then start reading at that point

  nowait           - ignore errors, just exit

  wait             - always wait before exiting (wait > nowait)


Note on Options: ('option flag enforced' error message)

  - if any option is preceeded by a dash, then ALL options need to be preceeded by a dash!

  - immprep version gdocs ~downloads~ mdash     - works

  - immprep version gdocs ~downloads~ -mdash    - will cause an error

  - immprep -version -gdocs ~downloads~ -mdash  - works

  - immprep -version -gdocs ~downloads~ mdash   - will cause an error


Replacements for <file> and <path>

  ~currentfolder~ - replaced with current directory

  ~downloads~     - replaced with c:\users\username\downloads

  ~nearby~        - replaced with folder containing immprep.exe

  ~programfolder~ - same as ~nearby~

  ~tempfolder~    - replaced with temp folder

  ~default~       - default paths hardcoded in emmprep, may differ from yours, handle with care


Immersive:

  - when processing a file immprep looks for the specific trigger or the following keywords:

  - <immersive> <<immersive>> < immersive > << immersive >> mark the start of a section to be read

  - <endimmersive> and its variants mark the end of a section to be read.

  - you can combine <text> and <endimmersive> to read selective chapters

  - detection of these tags suppressed by specifying the noimmersive parameter


Mode:

  0 - default

  x - not yet implemented


Return values:

  0 - Edge was opened

  1 - Edge was not opened


Call made to pandoc.exe:

  pandoc -s --metadata title=immprep -o "input.html" -f docx "output.html"


Examples:

  immprep html test.html noimmersive mdash

  immprep html "test.html" trigger "chapter 1 - "

  immprep gdocs ~downloads~ cleanup mdash edge ~default~

  immprep word d:\documents\novel\ anytime edge "C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe" 


Example batch file for Google Docs (called via TouchPortal with Google Docs in focus):

  rem tp_immersive_gdocs.bat

  immprep gdocs ~downloads~ cleanup mdash edge ~default~

  if errorlevel 1 goto noedge

    nircmd wait 500

    nircmd sendkeypress ctrl+shift+u

  :noedge

  exit


TouchPortal script for Google Docs:

  Key Press Alt+Shift+F

  Wait for 500 msec

  Key Press Alt+Shift+D

  Wait for 500 msec

  Key Press Alt+Shift+H

  Run script & open tp_immersive_gdocs.bat


Example batch file for Word (called via TouchPortal with Word in focus):

  rem tp_immersive_word.bat

  immprep word d:\novel\ mdash edge ~default~ anytime

  if errorlevel 1 goto noedge

    nircmd wait 500

    nircmd sendkeypress ctrl+shift+u

  :noedge

  exit


TouchPortal script for Word:

  Key Press Alt+F

  Wait for 500 msec

  Key Press S

  Wait for 500 msec

  Run script & open tp_immersive_word.bat


Old style batch file:

  cd c:\users\username\downloads

  waitfile "zipfilename.zip" 4 exist

  If errorlevel 2 goto done

  If errorlevel 1 goto done

  :zipfound

    tar -xf "zipfilename.zip"

    immprep html "htmlfilename.hmtl" mdash

    del "zipfilename.zip"

    "C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe" "c:\users\username\downloads\htmlfilename.html"

    waitwindow "htmlfilename*" 4

    timeout 1

    nircmd sendkeypress ctrl+shift+u

  :done

  exit


Suggested tools:

  edge           - Microsoft Edge is required for TTS

  nircmd         - Nirsoft multi-tool, here used to send a keypress to Microsoft Edge

  pandoc         - required when processing word .docx files

  touch portal   - use an old phone or tablet as your stream deck / macro keyboard

  tar            - Windows build-in unpacker (you already have this one)

  waitfile.exe   - halts batchfiles until a file (does not) exist (for more complex batch files)

  waitwindow.exe - halts batchfiles until a window is (not) open (for more complex batch files)

  cmdow          - allows manipulation of windows - note: once focus has been set using /ACT it cannot be changed again!


See also:

  the blog post at the following link shows how to use immprep, Touch Portal and nircmd to automate TTS proofreads:

  https://ninelizardsblog.blogspot.com/2023/03/launch-applications-from-touch-portal.html


Summary:

  immprep html|gdocs|word <path|path+file> [pandoc <path>] [noimmersive] [emdash] [anytime] [cleanup] [unread] [mode <n>] [edge <path+file>|~default~] [trigger "<text>"] [wait|nowait]

  with <path> = c:\|c:\path\abc.def|~nearby~|~current~|~downloads~|~default~|~tempfolder~


TIP!

If Edge starts to read in the wrong spot, put an << endimmersive >> somewhere near the end of your current chapter or the end of your document. That seems to help, though I've yet to understand why.


More


No comments:

Post a Comment