Plan 9 from Bell Labs’s /usr/web/sources/contrib/steve/root/sys/lib/texmf/tex/latex/base/cyrguide.tex

Copyright © 2021 Plan 9 Foundation.
Distributed under the MIT License.
Download the Plan 9 distribution.


% \iffalse meta-comment
%
% Copyright 1993 1994 1995 1996 1997 1998 1999
% The LaTeX3 Project and any individual authors listed elsewhere
% in this file. 
% 
% This file is part of the LaTeX2e system.
% ----------------------------------------
% 
% It may be distributed under the terms of the LaTeX Project Public
% License, as described in lppl.txt in the base LaTeX distribution.
% Either version 1.0 or, at your option, any later version.
% 
% \fi
\NeedsTeXFormat{LaTeX2e}[1995/12/01]

\documentclass{ltxguide}[1999/02/28]

\title{Cyrillic languages support in \LaTeX}

\author{\copyright~Copyright 1998--1999,\\ Vladimir Volovich,
        Werner Lemberg and \LaTeX3 Project Team.\\ All rights reserved.}

\date{12 March 1999}

\begin{document}

\maketitle
\tableofcontents

\begin{abstract}
  This document contains basic information on the Cyrillic setup for
  \LaTeX{}: how to get the fonts, how to set them up, how to use
  the interface, its interaction with \babel{}, etc. This is only a first
  draft of the document and it will probably be modified in future; so
  please send in comments on it via the \texttt{latexbug} system
  (see below).
\end{abstract}


\section{Introduction}

Most Latin-based European languages were supported in \LaTeX{} by
introducing the~|T1|~font encoding and by using the \textsf{fontenc}
and \textsf{inputenc} packages; these use only standard \TeX{} means
to support any \mbox{8-bit} input encoding and this one standard font
encoding.  The restriction to a single font encoding guarantees that
multiple languages can happily coexist in one document (\eg
hyphenation will be correct for all languages).

Starting with the December~1998 Release, \LaTeX{} finally supports
Cyrillic languages.  This support is based on the new standard
Cyrillic \TeX{} font encodings---|T2A|, |T2B|, |T2C|, and~|X2|.  The
first three of these satisfy some basic requirements for
\LaTeX{}~|T*|~encodings, and thus can be used in multi-lingual documents
with other languages based on standard font encodings.

The reason why we need four different Cyrillic font encodings is that
these font encodings support \emph{all} the Cyrillic languages that
have been used during the twentieth century (see
Section~\ref{fontencs})!  The number of Cyrillic glyphs is large, so
they cannot be represented with 128~character slots; the other (lower)
128~slots are reserved for Latin letters and other invariant symbols
that are needed for the encoding to be a conformant
\LaTeX{}~\texttt{T}~encoding.

There are some glyphs in the |T2*|~encodings which do not yet have
associated characters in \emph{Unicode}, the world-wide character
standard.  Also, one more font encoding, |T2D|,~is planned for a
forthcoming release of \LaTeX{}.  A lot of Cyrillic input encodings
are already supported (see Section~\ref{inputencs}), and additional
encodings could be added easily.


\subsection{Acknowledgments}
\label{sec:acks}

The work on |T2*|~encodings was carried out by the T2~Team, led by
Alexander Berdnikov (other members are Mikhail Kolodin and Andrew
Janishewskii).  The LH~fonts were produced by Olga Lapko (with
A.~Khodulev).  The \textsf{T2} bundle and \textsf{ruhyphen} package
were written by Werner Lemberg and Vladimir Volovich (except that the
concrete hyphenation patterns which are part of \textsf{ruhyphen} came
from individual authors).  The support for the Ukrainian language was
prepared by Andrij Shvaika.
  

\section{Installation}

The \textsf{fontenc} and \textsf{inputenc} packages are installed
automatically in every base \LaTeX{} distribution.

All the necessary extra files to use with these packages for Cyrillic
are in the \textsf{cyrillic} bundle, which at present contains the
following: four font encoding definition files (|t2aenc.def|,
|t2benc.def|, |t2cenc.def|, |x2enc.def|); several input encoding
definition files (all the other |*.def| files), and font definition
files (|*.fd|).
The installation of these is described here.

\subsection{Fonts}

The default font families in \LaTeX{} are the Computer Modern
families, namely the CM~fonts (|OT1|~encoded) and the EC~fonts
(|T1|~encoded).  The LH~fonts, which are now available, provide
Computer Modern fonts for all Cyrillic font encodings.  They are
designed to be compatible with the EC~fonts, and they provide the same
font shapes and sizes; they are available at |CTAN:fonts/cyrillic/lh|
(the latest version is 3.20).  The installation instructions for the
fonts are in the file |INSTALL| in the font distribution.

Other fonts, including Type~1 fonts, can also be used, provided that
their encoding (for \TeX{}) is \mbox{|T2|-compatible}.  Some
ready-to-use packages supporting such fonts are also available, \eg at
\URL{ftp://ftp.vsu.ru/pub/tex} (they should soon be on \ctan).  Currently,
you will find two packages there: \textsf{PsCyr}, which contains some
freely distributable Cyrillic Type~1 fonts with support for \LaTeX{};
and \textsf{c1fonts}, which contains virtual fonts similar to the
\textsf{AE}~fonts package using the BlueSky and BaKoMa fonts
available from \ctan{} (see the |README| file in that package for
detailed information).  Further font packages are expected soon.

\subsection{Hyphenation patterns}

You can find a collection of hyphenation patterns for the Russian
language in the \textsf{ruhyphen} package at
|CTAN:language/hyphenation/ruhyphen|.  These patterns support the
|T2*|~encodings, as well as other popular font encodings used for
Russian typesetting (including the Omega internal encoding).
Patterns for other Cyrillic languages should be adapted to work with
the |T2*|~encodings.

\subsection{\babel{} support for Russian and Ukrainian}
\label{bblrus}

Version~3.6k of \babel{} includes support for the |T2*|~encodings and
for typesetting both Russian and Ukrainian texts using the Cyrillic
letters.  The temporary fontencoding |LWN|, which was used in earlier
releases of \babel{}, will be withdrawn in the near future and replaced
by the |OT2| encoding.

\subsection{Getting pre-built packages}
 
Many of the major \TeX{} distributions, such as te\TeX{}, fp\TeX{} and
\TeX{}live, contain (or soon will) everything that is needed,
including the LH~fonts, \textsf{ruhyphen} and the latest version of
\babel{}.  We hope that all \TeX{} distributions will soon include all
of these, so that the chances are that you will not need to install
this by yourself (but it is not difficult).

If you are using em\TeX, Mik\TeX, or fp\TeX, you
can download the \textsf{ruemtex} package from
\URL{ftp://ftp.vsu.ru/pub/tex}.

\section{Usage}

Support for Cyrillic is based on these standard \LaTeX{} mechanisms:
the \textsf{fontenc} and \textsf{inputenc} packages (and on \babel{}).
Thus the basic principles for its use are similar to those for other
European languages: you simply add, to your document preamble, lines
like the following.

\begin{verbatim}
\usepackage[T2A]{fontenc}
\usepackage[koi8-r]{inputenc}
\end{verbatim}

Here you can put any desired input encoding instead of
\mbox{\texttt{koi8-r}}: for example, it would be \texttt{cp866} if you are
using a MS-DOS text editor with this Cyrillic code page to prepare your
documents, or \texttt{cp1251} if you are a MS~Windows user with Cyrillic
support.  A full list of the available Cyrillic encodings can be found in
Section~\ref{inputencs} and in the file |cyinpenc.dtx|.

Documents are, naturally, not restricted to a single font encoding;
this is essential for multi-lingual journals or documents.  Such
changes can be made by using the |\fontencoding| command as part of a
font-change.  However, it is best to access these font encodings via a
higher-level interface.

Since such changes are often closely related to other
language-dependent settings, it is often sensible to use the \babel{}
system, which provides further useful `localisation' and standardised
multi-lingual interfaces (for further details, see
Section~\ref{bblrus}). Then you can use lines like the following in
your document:

\begin{verbatim}
\usepackage[koi8-r]{inputenc}
\usepackage[russian]{babel}
\end{verbatim}

This will automatically choose the default font encoding for Russian,
which is |T2A|, if available.  Documentation of the complete set of
font-encoding selection rules can be found in |cyrillic.dtx| which is
part of |rusbabel|.

These \LaTeX{} interfaces are very convenient because they make your
documents completely portable, being based solely on standard \TeX{}
features.  This will mean that your documents can be processed on any
\TeX{} system without any need for re-encoding to the `native'
encoding used on each platform; this is because the encoding of the
document is specified in the document itself.

Moreover, if necessary, more than one input encoding can be used
within a document; this could be useful if, for example, you need to
combine articles prepared by authors on different machines.  Each part
of the document is then identified by a |\inputencoding| command,
which can therefore only be used between paragraphs.

Please note that you must always use the two standard \LaTeX{}
commands, |\MakeUppercase| and |\MakeLowercase| to produce uppercase
or lowercase text in your documents.  This is because |\uppercase| and
|\lowercase| will not work at all for Cyrillic (note that these latter
two commands are not, and never have been, available for use directly
in \LaTeX{} documents).


\section{Font encodings for Cyrillic languages}
\label{fontencs}

The Cyrillic font encodings support the following languages.  Note
that some languages can be properly typeset with more than one
encoding.

\begin{itemize}
\raggedright
\item[|T2A|:]
  Abaza, Avar, Agul, Adyghei, Azerbaijani, Altai, Balkar, Bashkir,
  Bulgarian, Buryat, Byelorussian, Gagauz, Dargin, Dungan, Ingush,
  Kabardino-Cherkess, Kazakh, Kalmyk, Karakalpak, Karachaevskii,
  Karelian, Kirghiz, Komi-Zyrian, Komi-Permyak, Kumyk, Lak, Lezghin,
  Macedonian, Mari-Mountain, Mari-Valley, Moldavian, Mongolian,
  Mordvin-Moksha, Mordvin-Erzya, Nogai, Oroch, Osetin, Russian, Rutul,
  Serbian, Tabasaran, Tadzhik, Tatar, Tati, Teleut, Tofalar, Tuva,
  Turkmen, Udmurt, Uzbek, Ukrainian, Hanty-Obskii, Hanty-Surgut,
  Gipsi, Chechen, Chuvash, Crimean-Tatar.
\item[|T2B|:]
  Abaza, Avar, Agul, Adyghei, Aleut, Altai, Balkar, Byelorussian,
  Bulgarian, Buryat, Gagauz, Dargin, Dolgan, Dungan, Ingush, Itelmen,
  Kabardino-Cherkess, Kalmyk, Karakalpak, Karachaevskii, Karelian,
  Ketskii, Kirghiz, Komi-Zyrian, Komi-Permyak, Koryak, Kumyk, Kurdian,
  Lak, Lezghin, Mansi, Mari-Valley, Moldavian, Mongolian,
  Mordvin-Moksha, Mordvin-Erzya, Nanai, Nganasan, Negidal, Nenets,
  Nivh, Nogai, Oroch, Russian, Rutul, Selkup, Tabasaran, Tadzhik,
  Tatar, Tati, Teleut, Tofalar, Tuva, Turkmen, Udyghei, Uigur, Ulch,
  Khakass, Hanty-Vahovskii, Hanty-Kazymskii, Hanty-Obskii,
  Hanty-Surgut, Hanty-Shurysharskii, Gipsi, Chechen, Chukcha, Shor,
  Evenk, Even, Enets, Eskimo, Yukagir, Crimean Tatar, Yakut.
\item[|T2C|:]
  Abkhazian, Bulgarian, Gagauz, Karelian, Komi-Zyrian, Komi-Permyak,
  Kumyk, Mansi, Moldavian, Mordvin-Moksha, Mordvin-Erzya, Nanai,
  Orok (Uilta), Negidal, Nogai, Oroch, Russian, Saam, Old-Bulgarian,
  Old-Russian, Tati, Teleut, Hanty-Obskii, Hanty-Surgut, Evenk,
  Crimean Tatar.
\end{itemize}

The |X2|~encoding was designed to support all the above languages.
Its name does not start with |T| because, for example, it contains no
Latin letters (it is purely a Cyrillic glyph container); it therefore
cannot be used in mixed-script documents along with the other |T*|
encodings.  Please consult Section~6.4 \textit{Naming conventions} of
the file |fntguide.tex| in the base \LaTeX{} distribution for details
of the differences between \LaTeX{} font encodings and how they are
named.

There are two other \LaTeX{} Cyrillic font encodings, |OT2| and |LCY|,
that are not included in the base \LaTeX{} distribution.  The first is
a \mbox{7-bit} encoding (hence the |O|) developed by the AMS; it is
useful for typesetting relatively small fragments of text in Cyrillic,
using a Latin transliteration scheme.  The other, |LCY|, is an
\mbox{8-bit} Cyrillic encoding which is not compatible with the
requirements for \LaTeX{} |T*|~encodings (hence the |L|); thus it is not
suitable for typesetting multi-lingual documents, but it can be used in
Plain \TeX{}-based macro packages because it is an extension of |OT1|.
These two encodings are supported by \babel{} and by \textsf{ot2cyr}.


\section{Input encodings}
\label{inputencs}

Several Cyrillic code-pages are widely used.  Currently, \LaTeX{}
contains support for 20~Cyrillic input encodings (some of which are
variants of each other).

\begin{itemize}

\item |cp855| --- the standard \mbox{MS-DOS} Cyrillic code-page.

\item |cp866| --- the standard \mbox{MS-DOS} Russian code-page.  
  Several code-pages very similar to this are also supported
  (the differences are all in the range 242--254).
  \begin{itemize}
  \item |cp866av| -- the `Cyrillic Alternative' code-page (an
    alternative variant of cp866);
  \item |cp866mav| --  the `Modified Alternative Variant'; 
  \item |cp866nav| --  the `New Alternative Variant'; 
  \item |cp866tat| --  an experimental Tatarian code-page.    
  \end{itemize}

\item |cp1251| --- the standard MS Windows Cyrillic code-page.

\item \mbox{\texttt{koi8-r}} --- a standard Cyrillic code-page widely
  used in UNIX-like systems for Russian language support that is
  specified in RFC~1489.  The situation with \mbox{\texttt{koi8-r}} is
  somewhat similar to that for |cp866|: there are several similar
  code-pages which coincide for all Russian letters but add some other
  Cyrillic letters. The following are supported:
  \begin{itemize}
  \item \mbox{\texttt{koi8-u}}  -- for Ukrainian; 
  \item \mbox{\texttt{koi8-ru}} -- this is described in a draft RFC
    document specifying a widely used character set for mail and news
    exchange in the Ukrainian internet community, as well as for
    presenting WWW information resources in the Ukrainian language;
  \item |isoir111| -- the \mbox{ISO-IR-111 ECMA} Cyrillic Code Page.
  \end{itemize}

\item |iso88595| --- the \mbox{ISO 8859-5} Cyrillic code-page (also called
  \mbox{ISO-IR-144}).
  
\item |maccyr| --- the Apple Macintosh Cyrillic code-page (also known
  as Microsoft cp10007) and |macukr|, the Apple Macintosh Ukrainian
  code-page, very similar to the Cyrillic code-page.

\item The Mongolian code-pages: |ctt| |dbk| |mnk| |mos| |ncc| |mls|.
  These code-pages were taken from Oliver Corff's `Mon\TeX' package
  (available at |CTAN:language/mongolian/montex|).  Since the |T2*|
  encodings support the Mongolian Cyrillic script, it is convenient to
  have support for Mongolian input encodings as well.  Pointers to
  documentation for these code-pages will be much appreciated.

\end{itemize}


\section{Reporting bugs}

In case you find a bug and want to report it, please follow the
guidelines given in the file |bugs.txt| in the base \LaTeX{}
distributions.  Note that there is a category specifically for
reporting any bugs that occur only when using Cyrillic fonts or
support packages.


\section{Miscellanea in the \textsf{T2} bundle}
\label{t2m}

The \textsf{T2}~bundle at |CTAN:macros/latex/contrib/supported/t2|
contains some other useful files, including support for Plain
\TeX{}-based macro packages, support for Bib\TeX{} and MakeIndex (see
also the \textsf{xindy} program and package---highly recommended for
making indices with Cyrillic), support for the \textsf{fontinst}
package, mapping tables relating these Cyrillic font encodings (and
input encodings) to the Unicode character names and slots (these are
in the subdirectory |enc-maps|), and more!

To produce documented source listings of the \textsf{T2}~package, run
\LaTeX{} on the |*.dtx| and |*.fdd| files therein.

When typesetting Cyrillic texts, there is a tradition of using
Cyrillic letters (in some situations) within math formul\ae, in
exactly the same way as most of the world uses Latin letters.
By default this does not work, because symbols declared with
|\DeclareTextSymbol| may not be used in math.

If you need within math to `transparently' typeset glyphs declared in
font encoding definition files, then you could try using the
experimental \textsf{mathtext} package, which is also in the
\textsf{T2}~bundle.  Note that this package uses up at least one
additional math alphabet per font encoding.  For this and other
reasons, The \LaTeX3 Project Team considers that this experimental
extension to \LaTeX{}'s glyph-handling mechanisms should be used with
caution; but please try it out and send us your opinions and ideas.
Note that it is not included in the core of \LaTeX{} because both the
coding and the interfaces are likely to change at some point in the
future.

Finally, here are some pointers to further information:

\begin{quote}
  \URL{http://www.cemi.rssi.ru/cyrtug}\\
  \URL{http://xtalk.price.ru/tex}
\end{quote}

\end{document}

Bell Labs OSI certified Powered by Plan 9

(Return to Plan 9 Home Page)

Copyright © 2021 Plan 9 Foundation. All Rights Reserved.
Comments to [email protected].