                            -= HTML2TEXT v1.20b =-


                                       
                                       
                               HTML2TEXT v1.20b
                                       
                                       
                           1997 (c) Gavin Spearhead
                                       

-----------------------------------------------------------------------------

   I. What is it?
      
      HTML2TEXT is a utility that converts HTML files to plain text.
      Optionally it also tries to figure out if the HTML file is
      well-constructed.
      
      All Rights Reserved
      
      Permission to use, copy, and distribute this software and its
      documentation for any purpose and without fee is hereby granted,
      provided that the above copyright notice appear in all copies and that
      both that copyright notice and this permission notice appear in
      supporting documentation, and that the name Gavin Spearhead not be used
      in advertising or publicity pertaining to distribution of the software
      without specific, written prior permission.
      
      
                                 *** DISCLAIMER ***
      
      GAVIN SPEARHEAD DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE,
      INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO
      EVENT SHALL GAVIN SPEARHEAD BE LIABLE FOR ANY SPECIAL, INDIRECT OR
      CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF
      USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR
      OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR
      PERFORMANCE OF THIS SOFTWARE.
      
      
                                        ***
      
      Any bugs, errors, or suggestions should be sent to the author. Also the
      existence of not supported HTML-tags or amp-codes can be sent to the
      author, along with a description, restrictions and options.
      
      You are encouraged to register this software. This means that you will
      either receive latest versions when they are released or a note that a
      new version is released. It also gives me an idea about how many people
      use this program and how it's spread. There are three ways to register: 
         i. Start your web-browser and fill in the form 
         ii. Convert register.htm to a text file, edit it to fill in the
            entries and email it to me 
         iii. Same as above but send it to my postal address 
      Note that registration is Free of charge!
      
      When you're registered you will become a registration key, so that your
      name is written when you execute the program. This file will be sent via
      email if possible. This is currently the only way to receive the
      registration key.
      
      
   II. Which files are contained in the package?
      
      
      HTML2TXT.EXE  The executable                               
      HTML2TXT.CFG  Configuration file with option               
      HTML2TXT.INI  Ini-file with amp strings                    
      HTML2TXT.HTM  Documentation for HTML2TEXT in HTML format   
      HTML2TXT.TXT  Documentation for HTML2TEXT in text format   
      REGISTER.HTM  Registration form in HTML format             
      
      If one of the files is missing, throw the package away and ask the
      author for a new complete copy. The address is at the end of the file.
      
      
   III. How to start it?
      
      Type at the commandline:
      
      
  HTML2TXT <filespecification> <options>
      
      <Filespecification> is the name of the files to convert, it may include
      wildcards. It may appear more than once on the command line. Note that
      long filenames (Windows 95) are not supported. This means that input
      filenames have to be of the 8.3 format (Every W95 file has a 8.3
      filename and optionally a long filename). The output will be a 8.3
      filename.
      
      <options> Can be the following
      
      
      -x   Errors are marked in the outputfile by [XXX <error> ]           
      -w   Warn for HTML-errors in sourcefile                              
      -s   Write output to standard output                                 
      -e   Display the title in the output file                            
      -a   Display the alternative text for an image                       
      -l   Write the link to output file                                   
      -i   Write user input fields and buttons to output file              
      -t   Reformat tables                                                 
      -q   Treat balancing of quotes strictly                              
      -h   Both Display a help screen                                      
      -?                                                                   
      -o   Controls the overwriting policy of existing files The suffixed  
           characters have the following meaning:                          
               * A: Always overwrite                                       
               * V: Never overwrite                                        
               * D: Always append                                          
                                                                           
                                                                           
      -r   controls the wrapping of lines policy:                          
           If suffixed by a number the length of a line will be maximally  
           that number, if suffixed by a '-' no line wrapping will take    
           place. If no suffix is given, the lines will be wrapped         
           according to the screen width (usually 80 characters)           
      
      The result file will have the same name as the original file, but with
      the extension specified in the config-file (default is .txt), unless the
      original extension was the same as the extension of the output file,
      then the extension will be '.tx1' (always).
      
      All messages are written to stderr.
      
      
   IV. What do the files HTML2TXT.INI & HTML2TXT.CFG do?
      
      
          * HTML2TXT.INI
            
            This file contains the translation table for ampersand sequences,
            ie. a sequence of characters of the form: <some_text>. The lines
            are of following format:
 <identifier>="<result>" 
            
            Where the identifier is the text between the '&' and the ';'.
            <result> is the text that will replace amp-sequence. The quotes
            are optional. The text can contain escape sequences (C-style) of
            the format \<char> where <char> can be: 
                # s : a space 
                # n : a newline 
            
            Or of the format \<number> then the character which ASCII value
            equals <number> is inserted. Every other character is literally
            insert, including quotes and bashes.
            
            Amp-codes of the format &#nnn; not specified in the config-file
            will be converted to the ASCII value nnn.
            
            
          * HTML2TXT.CFG This file contains the various options that can be
            set. These are all commented in the file itself.
            
            Beware that some options have side effects, eg. turning off line
            wrapping means also that text will not be centered.
            
            In both the files any line starting with a semicolon is treated as
            comments and thus ignored. 
            
            Both files will be sought for in the current directory first and
            then the directory from where HTML2TEXT was started. Usually these
            files will be placed in the same directory as HTML2TXT.EXE, a
            directory in your path.
            
            
      
   V. What does it do?
      
      HTML2TEXT converts HyperText Mark-up Language (HTML) files to plain-text
      (ASCII) files. The following rules are applies for this: 
          * The title is displayed on the first line of the screen 
          * Ampersand codes (&...;) are converted to character sequences
            according to the input file or pre-programmed characters. 
          * Any tags will perform the task according to the HTML
            specification as good as possible, note that some tags cannot have
            any output in plain ASCII text files (eg. blinking). 
          * Newlines and tabs are converted to spaces and are removed if
            obsolete, as are spaces. 
          * Lines are written and justified according to settings, lines
            are wrapped by words when they are too long. 
          * Warnings are generated on ill-constructed tags or amp-codes 
      
   VI. Which tags does it recognise?
      
      
      Tag         What it does in HTML2TEXT                                 
      A           Checks unless <a name=...>, optionally a [ Name ] or [    
                  Link ] is written                                         
      ADDRESS     See I                                                     
      APPLET      Checks, ignores text between <APPLET></APPLET>            
      AREA        Ignore                                                    
      B           Checks, Optionally writes BOLD-token                      
      BASE        Ignores                                                   
      BASEFONT    Ignores                                                   
      BGSOUND     Ignores                                                   
      BIG         Checks                                                    
      BLINK       Checks (Does this really make the text blink???)          
      BLOCKQUOTE  Checks, indents                                           
      BODY        Checks                                                    
      BR          Writes a newline                                          
      CAPTION     Ignore                                                    
      CENTER      Checks, centers when linewrap is on                       
      CITE        see I                                                     
      CODE        See Pre                                                   
      COMMENT     Ignores anything between <COMMENT></COMMENT>, Checks      
      DD          Inserts newline and indents                               
      DFN         See I                                                     
      DIR         See OL                                                    
      DIV         Checks, writes a newline at both open and close tag (Is   
                  this correct??)                                           
      DL          Starts a definition list, Checks                          
      DT          Inserts a newline                                         
      EM          see I                                                     
      EMBED       Ignore                                                    
      FRAME       Ignore                                                    
      FRAMESET    Checks                                                    
      FONT        Checks                                                    
      FORM        Checks                                                    
      H           Checks                                                    
      HD1         Writes the text to screen with embracing newlines         
      HD2                                                                   
      HD3                                                                   
      HD4                                                                   
      HD5                                                                   
      HD6                                                                   
      HEAD        Checks                                                    
      HR          Writes a line of '='s incase size >3 or else a line of    
                  '-'s. The length is absolute or relative set according to 
                  the width value.                                          
      HTML        Everything after </HTML> is ignored, Checks               
      I           Checks, Optionally writes ITALIC-token                    
      IMG         Ignored, Optionally writes 'alt' text                     
      INPUT       Ignored                                                   
      ISINDEX     Write a prompt plus optionally [ Input ]                  
      KBD         See B                                                     
      LI          Writes a listelement identifier, for ULs * or specified   
                  in config-file, for OL a number, parameter type and value 
                  used                                                      
      LINK        Ignored                                                   
      LISTING     See Pre                                                   
      MAP         Checks                                                    
      MARQUEE     Checks                                                    
      MENU        See OL                                                    
      META        Ignored                                                   
      NEXTID      Ignores                                                   
      NOBR        Checks                                                    
      NOFRAMES    Checks                                                    
      OL          An ordered list, Checks, type parameter used              
      OPTION      Ignore                                                    
      P           Starts a new paragraph                                    
      PRE         Outputted as is, Checks (line wrap is not ignored, if on) 
                                                                            
      S           see strike                                                
      SAMP        Checks                                                    
      SCRIPT      Ignores anything between <SCRIPT></SCRIPT>, Checks        
      SELECT      Checks                                                    
      SMALL       Checks                                                    
      SOUND       Ignores                                                   
      STRIKE      Checks                                                    
      STRONG      See B                                                     
      SUB         Checks                                                    
      SUP         Checks                                                    
      TABLE       Checks, starts/finishes a table                           
      TD          Defines a table cell                                      
      TEXTAREA    Ignored                                                   
      TITLE       Writes the title, if within <HEAD></HEAD>, Checks         
      TH          Defines a table header cell                               
      TR          Defines a table row                                       
      TT          Checks                                                    
      U           Checks                                                    
      UL          An unordered list, Checks, type parameter used            
      VAR         See Pre                                                   
      WBR         Ignores                                                   
      !DOCTYPE    Ignores                                                   
      
      Here Checks means that for every open tag a matching closing tag is
      sought. In most cases the order of the closing tag is not relevant, but
      rarely the output will be unexpected.
      
      Here ignores means that the tag is just ignored, no output is generated.
      
      Some tags may have optional closing tags, these are ignored and not
      checked. Eg. <tr>,<td>,<th>,<p>. Some tags need a closing tag but not
      always (eg <a name=...>) then only those who do need one will checked.
      Note that this just specifies the actions taken by HTML2TEXT and not
      what the HTML specification says
      
      Some of those need a closing tag (preceded by a slash), these will be
      checked, if the tag was opened before. It will also be checked if those
      tags are closed in the right order. Furthermore is checked that tags are
      not nested if not necessary (eg. bold), usually this indicates a missing
      slash in the tag in the second tag. Lots of tags are simply ignored and
      thus generate no output. Some tags optionally generate output. Any text
      after </html> is ignored. Some tags cause the following text to be
      ignored.
      
      Unknown tags are ignored and optionally a message is generated.
      
      Tables generate the following output. Every table row is written on a at
      least one line, and every row yields a linefeed. Table columns are
      separated by at least one space (no boxes or anything). Options are
      implemented for tables, but currently do not work very well, a row can
      only be affected by at most one rowspan and one colspan. Also text won't
      be strechted to the full length of cells with rowspans, the surrounding
      cells will be empty instead. Tables are squeezed to a minimum size, if
      linewrap is chosen. Otherwise a cell will be of the length of the
      longest cell in the column. For long tables check out the config-file to
      set some parameters so that those are handled well too (who uses tables
      larger than 256 x 10 with cells of more than 64 KB, however some people
      build their whole pages in a table...). if you do you will have to
      increase the max_rows and the max_cols in the config-file. If this is
      necessary the most likely errormessages are error 13 and error 14. Also
      possible are error 7 and error 12. Except for error 12 these errors are
      fatal errors. Occasionally this may also lead a situation in which your
      machine seems not to respond. Nested tables aren't supported either,
      those will be treated as if the notables option is set to on. Only the
      outer-table will be formatted.
      
      
   VII. What does it output?
      
      HTML2TEXT can have two kinds of output: 
         1. It can just throw all output to standard output. This means
            that all files specified are concatenated to stdout. This also
            means that one can pipe or redirect the output directly.
            
            
         2. It can create files with extension specified in the config-file
            (or 'tx1' in case the input has that extension). If the output
            already exists, the user is asked to confirm overwriting of the
            file.
            
            
      
      Note that all messages are written to standard error. This is because
      one needs to make a distinction between the converted text and the
      additional info outputted by HTML2TEXT. Thus any messages are written to
      the screen even if stdout is redirected. Standard error can be
      redirected as well btw (however command.com does not support it). Also
      the hush option will prevent output to stderr.
      
      
   VIII. Errors and Warnings
      
      Errors
      
      
      Error  Error string                   Description                     
      1      Illegal parameter              A command line parameter was    
                                            not recognised                  
      2      No such file                   No file was found matching the  
                                            file name specification         
      3      No filename specified          No file specification was found 
                                            on the command line             
      4      Config-file not found          The program could not locate    
                                            the config-file, which is       
                                            usually found in the current    
                                            directory or in the directory   
                                            containing html2txt.exe         
      5      Ini-file not found             The program could not locate    
                                            the ini-file (see error 4)      
      6      Error in ini-file              One entry in the ini-file       
                                            contains an illegal value       
      7      Not enough memory              There wasn't enough memory to   
                                            execute the program             
      8      File couldn't be opened        One file could not be found or  
                                            opened                          
      9      Error in config-file           One entry in the config-file    
                                            contains an illegal value       
      10     To many amp-codes in ini-file  The ini-file contains too many  
                                            codes to hold in memory         
      11     File skipped                   The file couldn't be converted  
      12     Heap corrupted                 Memory is being corrupted       
                                            during the conversion           
      13     Too many rows in table         The table contains more rows    
                                            than the program can keep in    
                                            memory                          
      14     Too many columns in table      The table contains more columns 
                                            than the program can keep in    
                                            memory                          
      15     Specified path is illegal      The ini-file couldn't be found  
                                            in the specified path           
      16     Could not create temporary     There isn't enough space on the 
             file                           disk or there arn'te enough     
                                            handles to open a temporary     
                                            file                            
      17     File writing error: Disk full  An error occured while writing  
                                            a file, most like is that the   
                                            disk is full                    
      Warnings
      
      
      Warning  Warning String                Description                   
      256      Unrecognised HTML-code        The HTML-code was not         
                                             recognised, probably not      
                                             defined                       
      257      Ill-contructed HMTL-code      The HTML-code was different   
                                             from the one expected,        
                                             probable the order of codes   
                                             is swapped                    
      258      Illegal list item             The list item or list was of  
                                             an illegal type               
      259      Semicolon expected            The semicolon after a &...    
                                             sequence is missing           
      260      Illegal token                 A token was encountered which 
                                             wasn't legal in the context   
      261      Ill-constructed amp code      The amp code is not defined   
                                             in the ini-file               
      262      Misplaced <title>             The title appeared outside of 
                                             the head section              
      263      HTML-tag starts with space    The HTML tage starts with a   
                                             space character               
      264      Invalid list type             The type specified for a list 
                                             item or a list was illegal    
      265      Unexpected '>' encountered    A greater-than token was      
                                             encounted without a matching  
                                             less than token               
      266      LI without list               An LI tag was encountered     
                                             outside a list section        
      267      DD without definition list    An DD tag was encountered     
                                             outside a definition list     
      268      DT without definition list    An DT tag was encountered     
                                             outside a definition list     
      269      Tables within tables not      A table section within a      
               supported                     table was encountered         
      270      Table cell truncated          A table cell contained more   
                                             than 65K data                 
      
   IX. Problems & Open Issues
      
      
          * Forms
            Forms may look a bit silly when they are wider than the maximum
            width of a line, especially within table.
            
            
          * Tables
            Tables are reformated in a rather simplified manner (see above)
            Captions are printed not regarding alignment.
            Future editions will have improved table processing.
            Large tables may produce erronous output (one cell may not contain
            more than 65k bytes). Too many columns in a table will terminate
            the program. 
          * Too many newline
            Some combinations of tags will result in too many newlines
            
            
      
   X. How to obtain a new copy of HTML2TEXT?
      
      There are several ways to obtain a copy of html2text: 
         1. Download it here: www.noord.bart.nl/~wieger1/h2t120b.zip
            
            
         2. Write an email to me and a ask. I will return the latest copy
            attached to the reply as a self-extracting archive.
            
            
         3. Write to my postal address and ask. Be sure to enough (Dutch)
            money to cover the postage. Also include the return address. For
            Dutch return addresses include a SASE (with enough postage).
            
            
         4. Look at the nearest BBS or internet site, for a copy.
            
            
      
   XI. What's left do work out?
      
      
          * Better table formatting, include options of alignment, row- and
            colspan also alignment on caption isn't supported. 
          * Windows 95 Filenames 
          * Improve Form processing 
      
   XII. What changes were made?
      
      
      1.20a & 1.20b                                                         
                         * Bugs fixed and internal revisions                
                         * Table reformatting improved (no more garbage     
                     output)                                                
                         * Multiple file specifications on commandline      
                         * Added an option so that no output is generated   
                     (-h)                                                   
                                                                            
      1.10                                                                  
                         * Fixed some bugs                                  
                         * Improved commandline options parsing             
                                                                            
      1.02                                                                  
                         * Added formatting of tables.                      
                         * Better processing of user input (forms)          
                         * Line wrapping added internally (WORDWRAP.EXE is  
                     not necessary anymore)                                 
                         * Option added to set line wrap                    
                         * WORDWRAP package not included anymore            
                         * Amp sequences of the format &#nnn; can now be    
                     defined, but can be ignored.                           
                         * A very nasty bug fixed which corrupted the heap  
                         * HTML warnings optional                           
                         * Added and removed commandline options            
                                                                            
      1.01                                                                  
                         * Fixed a nasty bug, when outputing to STDOUT, the 
                     # of errors was displayed before the last text.        
                         * More intellegent algorithm for performing simple 
                     text                                                   
                         * Added value and type options to <LI> tag         
                         * Added option for title                           
                         * Decreased default value for stack and tag size   
                         * Documentation converted to HTML format           
                                                                            
      1.00                                                                  
                         * First version, not released                      
                                                                            
      
      
   XIII. How to reach the author?
      
      Write email to:
      wieger1@noord.bart.nl
      or
      schotanu@cs.utwente.nl
      
      Write to:
      Gavin Spearhead
      Witbreuksweg 387-302
      7522 ZA Enschede
      The Netherlands
      
      This the latest version of this file can be found at
      www.noord.bart.nl/~wieger1/html2txt.htm 

-----------------------------------------------------------------------------

