[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

List formatting in Docbook -> PDF/PS



Hello, just dropping in to chat about the formatting of lists through the
Docbook -> PDF/PS transform. IIRC, we use jade to go from Docbook to HTML,
then use htmldoc to go from HTML to PS and PDF. Unfortunately, there's a
bug in this chain that ends up with list items having entirely too much
vertical whitespace in the PDF and PS output.

The root of the problem is that jade takes the Docbook source for a list as
follows:

<orderedlist>
<listitem>
<para>
...content...
</para>
</listitem>
</orderedlist>

and does the simplest possible transform to get to HTML, like:

<OL
><LI
><P
>...content...</P
></LI
></OL

See the <P></P> tag enclosing the listitem content? I think that's valid
HTML; browsers ignore the first <P> tag within listitem content. The
problem is that htmldoc doesn't. It sees <P>, says "Okay, insert
linefeed.", and we ugly list formatting in our PDF and PS, like:

 o 

   ...content...

I will send a note to the htmldoc support email address, explaining the
problem. Their web site states that they offer support for $99, so I don't
know if we will get anywhere with that. It's worth a shot.

I've taken a look at the htmldoc source but things get rather complicated
quickly; I don't think I've got the coding skills needed to put together a
good patch. If anyone else with C++ knowledge wants to try, the source is
available at http://www.easysw.com.

So, assuming that we won't get a fix for the problem at the right place in
a timely fashion, I sat down and crafted / threw together a Perl script to
preprocess the HTML before we feed it to htmldoc. You'll find it attached;
it's pretty simple, but it does the job, and I think you'll find the
results worthwhile. Feedback and more testing (I've only tested it on my
own DB2-HOWTO), of course, is welcome.

-- 
Dan Scott,
Friend of the abnormal.
#!/usr/bin/perl -w

use strict;

sub checkPath($);
sub processHTML($);

my $manpage = '
SYNOPSIS 
   prehtml [SOURCE DIRECTORY] [DESTINATION DIRECTORY]

DESCRIPTION 
   Pre-processes HTML files from a specified directory and
   copies the output to a specified directory. This script
   only considers files with a ".html" extension to be HTML
   files; to changes this, you currently have to change the
   value of the $extension variable in the Perl source.

   Run this script before running htmldoc.

   Removes first set of paragraph tags from list items
   in HTML generated from Docbook by jade to improve
   the formatting of PostScript and PDF produced by htmldoc.

   To make a long story short, jade uses absolutely correct HTML and
   wraps the content of each list item in a <P></P> tag. htmldoc
   dumbly treats this first <P></P> tag generically and inserts a
   linefeed into the Postscript and PDF versions of the documents.
   This script makes htmldoc work better.

   --help
      Display this help and exit.

AUTHOR
   Written by Dan Scott.

TODO
   Make stdout / stdin valid streams.

   Improve argument handling (add switches for source and
   destination directories).
  
   Let the user specify the $extension variable.

REPORTING BUGS
   Report bugs to <dan.scott@acm.org>. I will try to fix them,
   but I do have a real job...

COPYRIGHT
   Copyright @ 2000 Dan Scott.
   This is free software. There is NO warranty; not even for
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
';

my $extension;
my $sourcePath;
my $outPath;

$extension = ".html";

if (@ARGV != 2 || $ARGV[0] eq "--help" || $ARGV[1] eq "--help")
{
  print $manpage;
  exit(0);
}
else 
{
  $sourcePath = checkPath($ARGV[0]);
  $outPath = checkPath($ARGV[1]);
}

opendir SOURCEDIR, "$sourcePath";
my @files = readdir SOURCEDIR;
closedir SOURCEDIR;

foreach my $sourceFile (@files)
{
  next if ($sourceFile =~ /^\.+$/ ); # Skip any string that is just dots
  next if ($sourceFile !~ /$extension/ ); # Skip any file that doesn't match $extension
  open INFILE, ("<$sourcePath$sourceFile") || die ("Error: could not open $sourceFile for input: $!\n");
  undef $/; # Enable slurp mode -- read every line at once
  my $htmlsource = <INFILE>;
  close INFILE;

  processHTML($htmlsource);

  open OUTFILE, (">$outPath$sourceFile") || die ("Error: could not open $sourceFile for output: $!\n");
  print OUTFILE $htmlsource;
  close OUTFILE;
}
exit(0);

sub checkPath($)
{
# checkPath ensures that a given path has a trailing "/" character.
  my $path = shift(@_);
  if (substr($path, -1, 1 ne "/")) 
  {
    $path = $path . "/";
  }
  return $path;
}

sub processHTML($)
# processHTML does the real work. It removes the first
# <P></P> tag in every list item element.
{
  for (@_) { s/(><LI\n)><P\n(.*?)<\/P\n>/$1$2\n/gms}
}