HEX

File: //usr/local/share/man/man3/Encoding::FixLatin.3pm
.\" Automatically generated by Pod::Man 2.27 (Pod::Simple 3.28)
.\"
.\" Standard preamble:
.\" ========================================================================
.de Sp \" Vertical space (when we can't use .PP)
.if t .sp .5v
.if n .sp
..
.de Vb \" Begin verbatim text
.ft CW
.nf
.ne \\$1
..
.de Ve \" End verbatim text
.ft R
.fi
..
.\" Set up some character translations and predefined strings.  \*(-- will
.\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left
.\" double quote, and \*(R" will give a right double quote.  \*(C+ will
.\" give a nicer C++.  Capital omega is used to do unbreakable dashes and
.\" therefore won't be available.  \*(C` and \*(C' expand to `' in nroff,
.\" nothing in troff, for use with C<>.
.tr \(*W-
.ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p'
.ie n \{\
.    ds -- \(*W-
.    ds PI pi
.    if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch
.    if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\"  diablo 12 pitch
.    ds L" ""
.    ds R" ""
.    ds C` ""
.    ds C' ""
'br\}
.el\{\
.    ds -- \|\(em\|
.    ds PI \(*p
.    ds L" ``
.    ds R" ''
.    ds C`
.    ds C'
'br\}
.\"
.\" Escape single quotes in literal strings from groff's Unicode transform.
.ie \n(.g .ds Aq \(aq
.el       .ds Aq '
.\"
.\" If the F register is turned on, we'll generate index entries on stderr for
.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index
.\" entries marked with X<> in POD.  Of course, you'll have to process the
.\" output yourself in some meaningful fashion.
.\"
.\" Avoid warning from groff about undefined register 'F'.
.de IX
..
.nr rF 0
.if \n(.g .if rF .nr rF 1
.if (\n(rF:(\n(.g==0)) \{
.    if \nF \{
.        de IX
.        tm Index:\\$1\t\\n%\t"\\$2"
..
.        if !\nF==2 \{
.            nr % 0
.            nr F 2
.        \}
.    \}
.\}
.rr rF
.\"
.\" Accent mark definitions (@(#)ms.acc 1.5 88/02/08 SMI; from UCB 4.2).
.\" Fear.  Run.  Save yourself.  No user-serviceable parts.
.    \" fudge factors for nroff and troff
.if n \{\
.    ds #H 0
.    ds #V .8m
.    ds #F .3m
.    ds #[ \f1
.    ds #] \fP
.\}
.if t \{\
.    ds #H ((1u-(\\\\n(.fu%2u))*.13m)
.    ds #V .6m
.    ds #F 0
.    ds #[ \&
.    ds #] \&
.\}
.    \" simple accents for nroff and troff
.if n \{\
.    ds ' \&
.    ds ` \&
.    ds ^ \&
.    ds , \&
.    ds ~ ~
.    ds /
.\}
.if t \{\
.    ds ' \\k:\h'-(\\n(.wu*8/10-\*(#H)'\'\h"|\\n:u"
.    ds ` \\k:\h'-(\\n(.wu*8/10-\*(#H)'\`\h'|\\n:u'
.    ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'^\h'|\\n:u'
.    ds , \\k:\h'-(\\n(.wu*8/10)',\h'|\\n:u'
.    ds ~ \\k:\h'-(\\n(.wu-\*(#H-.1m)'~\h'|\\n:u'
.    ds / \\k:\h'-(\\n(.wu*8/10-\*(#H)'\z\(sl\h'|\\n:u'
.\}
.    \" troff and (daisy-wheel) nroff accents
.ds : \\k:\h'-(\\n(.wu*8/10-\*(#H+.1m+\*(#F)'\v'-\*(#V'\z.\h'.2m+\*(#F'.\h'|\\n:u'\v'\*(#V'
.ds 8 \h'\*(#H'\(*b\h'-\*(#H'
.ds o \\k:\h'-(\\n(.wu+\w'\(de'u-\*(#H)/2u'\v'-.3n'\*(#[\z\(de\v'.3n'\h'|\\n:u'\*(#]
.ds d- \h'\*(#H'\(pd\h'-\w'~'u'\v'-.25m'\f2\(hy\fP\v'.25m'\h'-\*(#H'
.ds D- D\\k:\h'-\w'D'u'\v'-.11m'\z\(hy\v'.11m'\h'|\\n:u'
.ds th \*(#[\v'.3m'\s+1I\s-1\v'-.3m'\h'-(\w'I'u*2/3)'\s-1o\s+1\*(#]
.ds Th \*(#[\s+2I\s-2\h'-\w'I'u*3/5'\v'-.3m'o\v'.3m'\*(#]
.ds ae a\h'-(\w'a'u*4/10)'e
.ds Ae A\h'-(\w'A'u*4/10)'E
.    \" corrections for vroff
.if v .ds ~ \\k:\h'-(\\n(.wu*9/10-\*(#H)'\s-2\u~\d\s+2\h'|\\n:u'
.if v .ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'\v'-.4m'^\v'.4m'\h'|\\n:u'
.    \" for low resolution devices (crt and lpr)
.if \n(.H>23 .if \n(.V>19 \
\{\
.    ds : e
.    ds 8 ss
.    ds o a
.    ds d- d\h'-1'\(ga
.    ds D- D\h'-1'\(hy
.    ds th \o'bp'
.    ds Th \o'LP'
.    ds ae ae
.    ds Ae AE
.\}
.rm #[ #] #H #V #F C
.\" ========================================================================
.\"
.IX Title "Encoding::FixLatin 3"
.TH Encoding::FixLatin 3 "2014-05-22" "perl v5.16.3" "User Contributed Perl Documentation"
.\" For nroff, turn off justification.  Always turn off hyphenation; it makes
.\" way too many mistakes in technical documents.
.if n .ad l
.nh
.SH "NAME"
Encoding::FixLatin \- takes mixed encoding input and produces UTF\-8 output
.SH "SYNOPSIS"
.IX Header "SYNOPSIS"
.Vb 1
\&    use Encoding::FixLatin qw(fix_latin);
\&
\&    my $utf8_string = fix_latin($mixed_encoding_string);
.Ve
.SH "DESCRIPTION"
.IX Header "DESCRIPTION"
Most encoding conversion tools take input in one encoding and produce output in
another encoding.  This module takes input which may contain characters in more
than one encoding and makes a best effort to convert them all to \s-1UTF\-8\s0 output.
.SH "EXPORTS"
.IX Header "EXPORTS"
Nothing is exported by default.  The only public function is \f(CW\*(C`fix_latin\*(C'\fR which
will be exported on request (as per \s-1SYNOPSIS\s0).
.SH "FUNCTIONS"
.IX Header "FUNCTIONS"
.SS "fix_latin( string, options ... )"
.IX Subsection "fix_latin( string, options ... )"
Decodes the supplied 'string' and returns a \s-1UTF\-8\s0 version of the string.  The
following rules are used:
.IP "\(bu" 4
\&\s-1ASCII\s0 characters (single bytes in the range 0x00 \- 0x7F) are passed through
unchanged.
.IP "\(bu" 4
Well-formed \s-1UTF\-8\s0 multi-byte characters are also passed through unchanged.
.IP "\(bu" 4
\&\s-1UTF\-8\s0 multi-byte character which are over-long but otherwise well-formed are
converted to the shortest \s-1UTF\-8\s0 normal form.
.IP "\(bu" 4
Bytes in the range 0xA0 \- 0xFF are assumed to be Latin\-1 characters (\s-1ISO8859\-1\s0
encoded) and are converted to \s-1UTF\-8.\s0
.IP "\(bu" 4
Bytes in the range 0x80 \- 0x9F are assumed to be Win\-Latin\-1 characters (\s-1CP1252\s0
encoded) and are converted to \s-1UTF\-8. \s0 Except for the five bytes in this range
which are not defined in \s-1CP1252 \s0(see the \f(CW\*(C`ascii_hex\*(C'\fR option below).
.PP
The achilles heel of these rules is that it's possible for certain combinations
of two consecutive Latin\-1 characters to be misinterpreted as a single \s-1UTF\-8\s0
character \- ie: there is some risk of data corruption.  See the '\s-1LIMITATIONS\s0'
section below to quantify this risk for the type of data you're working with.
.PP
If you pass in a string that is already a \s-1UTF\-8\s0 character string (the utf8 flag
is set on the Perl scalar) then the string will simply be returned unchanged.
However if the 'bytes_only' option is specified (see below), the returned
string will be a byte string rather than a character string.  The rules
described above will not be applied in either case.
.PP
The \f(CW\*(C`fix_latin\*(C'\fR function accepts options as name => value pairs.  Recognised
options are:
.IP "bytes_only => 1/0" 4
.IX Item "bytes_only => 1/0"
The value returned by fix_latin is normally a Perl character string and will
have the utf8 flag set if it contains non-ASCII characters.  If you set the
\&\f(CW\*(C`bytes_only\*(C'\fR option to a true value, the returned string will be a binary
string of \s-1UTF\-8\s0 bytes.  The utf8 flag will not be set.  This is useful if
you're going to immediately use the string in an \s-1IO\s0 operation and wish to avoid
the overhead of converting to and from Perl's internal representation.
.IP "ascii_hex => 1/0" 4
.IX Item "ascii_hex => 1/0"
Bytes in the range 0x80\-0x9F are assumed to be \s-1CP1252,\s0 however \s-1CP1252\s0 does not
define a mapping for 5 of these bytes (0x81, 0x8D, 0x8F, 0x90 and 0x9D).  Use
this option to specify how they should be handled:
.RS 4
.IP "\(bu" 4
If the ascii_hex option is set to true (the default), these bytes will be
converted to 3 character \s-1ASCII\s0 hex strings of the form \f(CW%XX\fR.  For example the
byte 0x81 will become \f(CW%81\fR.
.IP "\(bu" 4
If the ascii_hex option is set to false, these bytes will be treated as Latin\-1
control characters and converted to the equivalent \s-1UTF\-8\s0 multi-byte sequences.
.RE
.RS 4
.Sp
When processing text strings you will almost certainly never encounter these
bytes at all.  The most likely reason you would see them is if a malicious
attacker was feeding random bytes to your application.  It is difficult to
conceive of a scenario in which it makes sense to change this option from its
default setting.
.RE
.IP "overlong_fatal => 1/0" 4
.IX Item "overlong_fatal => 1/0"
An over-long \s-1UTF\-8\s0 byte sequence is one which uses more than the minimum number
of bytes required to represent the character.  Use this option to specify how
overlong sequences should be handled.
.RS 4
.IP "\(bu" 4
If the overlong_fatal option is set to false (the default) over-long sequences
will be converted to the shortest normal \s-1UTF\-8\s0 sequence.  For example the input
byte string \*(L"\exC0\exBCscript>\*(R" would be converted to \*(L"<script>\*(R".
.IP "\(bu" 4
If the overlong_fatal option is set to true, this module will die with an
error when an overlong sequence is encountered.  You would probably want to
use eval to trap and handle this scenario.
.RE
.RS 4
.Sp
There is a strong argument that overlong sequences are only ever encountered
in malicious input and therefore they should always be rejected.
.RE
.IP "use_xs => 'auto' | 'always' | 'never'" 4
.IX Item "use_xs => 'auto' | 'always' | 'never'"
This option controls whether or not the \s-1XS \s0(compiled C) implementation of
\&\f(CW\*(C`fix_latin\*(C'\fR is used.  Note, the Encoding::FixLatin::XS module must be
installed separately.  The three possible values for this option are:
.RS 4
.IP "\(bu" 4
\&'auto' is the default behaviour \- if Encoding::FixLatin::XS is installed, it
will be loaded and used, otherwise the pure Perl implementation will be used.
.IP "\(bu" 4
\&'always' means the \s-1XS\s0 module will be used and a fatal exception will be thrown
if it is not available.
.IP "\(bu" 4
\&'never' means no attempt will be made to use the \s-1XS\s0 module.
.RE
.RS 4
.RE
.SH "LIMITATIONS OF THIS MODULE"
.IX Header "LIMITATIONS OF THIS MODULE"
This module is perfectly safe when handling data containing only \s-1ASCII\s0 and
\&\s-1UTF\-8\s0 characters.  Introducing \s-1ISO8859\-1\s0 or \s-1CP1252\s0 characters does add a risk
of data corruption (ie: some characters in the input being converted to
incorrect characters in the output).  To quantify the risk it is necessary to
understand it's cause.  First, let's break the input bytes into two categories.
.IP "\(bu" 4
\&\s-1ASCII\s0 bytes fall into the range 0x00\-0x7F \- the most significant bit is always
set to zero.  I'll use the symbol 'a' to represent these bytes.
.IP "\(bu" 4
Non-ASCII bytes fall into the range 0x80\-0xFF \- the most significant bit is
always set to one.  I'll use the symbol 'B' to represent these bytes.
.PP
A sequence of \s-1ASCII\s0 bytes ('aaa') is always unambiguous and will not be
misinterpreted.
.PP
Lone non-ASCII bytes within sequences of \s-1ASCII\s0 bytes ('aaBaBa') are also
unambiguous and will not be misinterpreted.
.PP
The potential for error occurs with two (or more) consecutive non-ASCII bytes.
For example the sequence '\s-1BB\s0' might be intended to represent two characters in
one of the legacy encodings or a single character in \s-1UTF\-8. \s0 Because this
module gives precedence to the \s-1UTF\-8\s0 characters it is possible that a random
pair of legacy characters may be misinterpreted as a single \s-1UTF\-8\s0 character.
.PP
The risk is reduced by the fact that not all pairs of non-ASCII bytes form
valid \s-1UTF\-8\s0 sequences.  Every non-ASCII \s-1UTF\-8\s0 character is made up of two or
more 'B' bytes and no 'a' bytes.  For a two-byte character, the first byte must
be in the range 0xC0\-0xDF and the second must be in the range 0x80\-0xBF.
.PP
Any pair of '\s-1BB\s0' bytes that do not fall into the required ranges are
unambiguous and will not be misinterpreted.
.PP
Pairs of '\s-1BB\s0' bytes that are actually individual Latin\-1 characters but
happen to fall into the required ranges to be misinterpreted as a \s-1UTF\-8\s0
character are rather unlikely to appear in normal text.  If you look those
ranges up on a Latin\-1 code chart you'll see that the first character would
need to be an uppercase accented letter and the second  would need to be a
non-printable control character or a special punctuation symbol.
.PP
One way to summarise the role of this module is that it guarantees to
produce \s-1UTF\-8\s0 output, possibly at the cost of introducing the odd 'typo'.
.SH "BUGS"
.IX Header "BUGS"
Please report any bugs to \f(CW\*(C`bug\-encoding\-fixlatin at rt.cpan.org\*(C'\fR, or through
the web interface at
<http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Encoding\-FixLatin>.  I will be
notified, and then you'll automatically be notified of progress on your bug as
I make changes.
.SH "SUPPORT"
.IX Header "SUPPORT"
You can also look for information at:
.IP "\(bu" 4
Issue tracker
.Sp
<https://github.com/grantm/encoding\-fixlatin/issues>
.IP "\(bu" 4
AnnoCPAN: Annotated \s-1CPAN\s0 documentation
.Sp
<http://annocpan.org/dist/Encoding\-FixLatin>
.IP "\(bu" 4
\&\s-1CPAN\s0 Ratings
.Sp
<http://cpanratings.perl.org/d/Encoding\-FixLatin>
.IP "\(bu" 4
Search \s-1CPAN\s0
.Sp
<http://search.cpan.org/dist/Encoding\-FixLatin/>
.IP "\(bu" 4
Source code repository
.Sp
<http://github.com/grantm/encoding\-fixlatin>
.SH "COPYRIGHT & LICENSE"
.IX Header "COPYRIGHT & LICENSE"
Copyright 2009\-2014 Grant McLean \f(CW\*(C`<grantm@cpan.org>\*(C'\fR
.PP
This program is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.