Plan 9 from Bell Labs’s /usr/web/sources/patch/applied/doc-man-1-doc2txt-wdoc2txt/doc2txt

Copyright © 2021 Plan 9 Foundation.
Distributed under the MIT License.
Download the Plan 9 distribution.


.TH DOC2TXT 1
.SH NAME
doc2txt, wdoc2txt, xls2txt, olefs, mswordstrings, msexceltable \- extract printable strings from Microsoft Office documents
.SH SYNOPSIS
.B doc2txt
[
.I file.doc
]
.br
.B wdoc2txt
[
.I file.doc
]
.br
.B xls2txt
[
.I file.xls
]
.br
.B aux/olefs
[
.B -m
.I mtpt
]
.I file.doc
.br
.B aux/mswordstrings 
.I /mnt/doc/WordDocument
.br
.B aux/msexceltable
[
.B -aDnt
] [
.B -d
.I delim
]
.B -w
.I worksheet-range
]
.I /mnt/doc/Workbook
.SH DESCRIPTION
.I Doc2txt
is an
.IR rc (1)
script that uses 
.I olefs
and
.I mswordstrings
to extract the printable text from the body of a Microsoft Word document and write it on the standard output.
.I Wdoc2txt
is similar, but uses
.IR plumb (1)
to send the output to a new
.IR acme (1)
window instead.
.I Xls2txt
performs a similar function for Microsoft Excel documents.
.PP
Microsoft Office documents are stored in OLE (Object Linking and Embedding)
format, which is a scaled down version of Microsoft's FAT file system.
.I Olefs
presents the contents of an Office document as a file system
on
.IR mtpt ,
which defaults to
.BR /mnt/doc .
.I Mswordstrings
or
.I msexceltables
may then be used to parse the files inside, extracting
a text stream.
.I Msexceltables
may be given options to control the formatting of its output.
.TP
.B -n
Disables field padding to colum width.
.TP
.B -t
Truncate fields to the colum width.
.TP
.B -a
Attempt conversion of non-tabular sheets in the workbook. (charts).
.TP
.BI -d " delim
Sets the interfield delimiter to the string
.IR delim ,
by default a single space.
.TP
.B -D
Enables debugging output.
.TP
.BI -w " worksheet-spec
Specifies which worksheets to process, by default all tabular sheets are
output \- suspressed chart pages are always included in the sheet count.
Arbitary lists of pages or page ranges may be given, individual pages
are seperated by commas, sheet ranges are seperated by a minus.
.SH EXAMPLE
.EX
	aux/olefs report.xls
	msexceltables -w 1,7,9-14,3-4 -n -d '@' /mnt/doc/Workbook
	unmount /mnt/doc
.EE
.SH SOURCE
.B /rc/bin/doc2txt
.br
.B /rc/bin/wdoc2txt
.br
.B /rc/bin/xls2txt
.br
.B /sys/src/cmd/aux/msexceltables.c
.br
.B /sys/src/cmd/aux/mswordstrings.c
.br
.B /sys/src/cmd/aux/olefs.c
.SH SEE ALSO
.IR strings (1)
.br
``Microsoft Word 97 Binary File Format'',
available on line at Microsoft's developer home page.
.br
``LAOLA Binary Structures'', 
.I http://snake.cs.tu-berlin.de:8081/~schwartz/pmh 
.br
``OpenOffice.Org's Excel Documentation'',
.I http://sc.openoffice.org/excelfileformat.pdf

Bell Labs OSI certified Powered by Plan 9

(Return to Plan 9 Home Page)

Copyright © 2021 Plan 9 Foundation. All Rights Reserved.
Comments to [email protected].