NAME
doc2txt, doc2ps, wdoc2txt, xls2txt, olefs, mswordstrings, msexceltables
– extract printable text from Microsoft documents |
SYNOPSIS
doc2txt [ file.doc ] doc2ps [ file.doc ] wdoc2txt [ file.doc ] xls2txt [ file.xls ] aux/olefs [ –m mtpt ] file.doc aux/mswordstrings mtpt/WordDocument aux/msexceltables [ –qaDnt ] [ –d delim ] [ –c column–range ] [ –w worksheet–range ] mtpt/Workbook |
DESCRIPTION
Doc2txt is an rc(1) script that uses olefs and mswordstrings to
extract the printable text from the body of a Microsoft Word document
and write it on the standard output. Doc2ps is similar, but emits
PostScript corresponding to the document. Wdoc2txt is similar
to doc2txt, but uses plumb(1) to send the output to a
new acme(1) window instead. Xls2txt performs a similar function
for Microsoft Excel documents.
Microsoft Office documents are stored in OLE (Object Linking and
Embedding) format, which is a scaled down version of Microsoft's
FAT file system. Olefs presents the contents of an MS Office document
as a file system on mtpt, which defaults to /mnt/doc. Mswordstrings
or msexceltables may then be used to parse
the files inside, extracting a text stream. Msexceltables may
be given options to control the formatting of its output. |
EXAMPLE
Extract pieces of an MS Excel spreadsheet.
|
SOURCE
/rc/bin doc2txt, doc2ps, wdoc2txt, and xls2txt /sys/src/cmd/aux the others |
SEE ALSO
strings(1) ``Microsoft Word 97 Binary File Format'', at Microsoft's developer (MSDN) home page. ``LAOLA Binary Structures'', http://user.cs.tu–berlin.de/~schwartz/pmh ``OpenOffice.Org's Excel Documentation'', http://sc.openoffice.org/excelfileformat.pdf |