===========================================================
Collate Order and Character Set - GLOB patterns and accents
===========================================================
-Ian! D. Allen - idallen@idallen.ca - www.idallen.com

This file should help you understand Unix/Linux scripts in a world of
increasing internationalization (i18n).

I used to say that a shell script only needed to set two things to behave
properly no matter what nonsense was set in the parent: PATH and umask

    #!/bin/sh -u
    PATH=/bin:/usr/bin ; export PATH
    umask 022

I've discovered a third and fourth necessity: setting character collation
order, and setting the acceptable input character set.  Without these
additions, scripts may behave differently depending on environment
variables set (or not set) in the parent process.

You will find these variables used in Unix/Linux start-up scripts for
network services.

-------------
Collate Order
-------------

Here is an example of the expected, intuitive strict numeric collation
order we've all come to expect over the past three decades:

    $ LC_COLLATE=C ; export LC_COLLATE     # collate in strict numeric order
    $ touch a A b B c C x X y Y z Z
    $ ls
    A  B  C  X  Y  Z  a  b  c  x  y  z     # expected sorted output
    $ ls | sort | fmt
    A B C X Y Z a b c x y z
    $ echo [a-z]
    a b c x y z
    $ echo [A-Z]
    A B C X Y Z

Below is the non-intuitive output that appears if you don't set the
character collation order to strict numeric, and you try to use ranges
with dashes in them:

    $ LC_COLLATE=en_US ; export LC_COLLATE   # many Linux distros set this!
    $ ls
    a  A  b  B  c  C  x  X  y  Y  z  Z       # note the new collate order!
    $ ls | sort | fmt
    a A b B c C x X y Y z Z
    $ echo [a-z]
    a A b B c C x X y Y z            # note how 'Z' is outside the range!
    $ echo [A-Z]
    A b B c C x X y Y z Z            # note how 'a' is outside the range!

With many modern Linux locale settings, such as en_US, en_CA, or even
en_CA.utf8, the character set is not laid out in strict numeric order;
the collating order places upper and lower case together, in this order:

    a A b B c C .... x X y Y z Z

and so the GLOB pattern [a-z] (which we expect to match only lower-case
letters) actually matches all the lower-case and all but one of the
upper-case letters (everything from 'a' to 'z') which means
a A b B c C .... x X y Y z (and not 'Z')!  The GLOB pattern [A-Z] (which
we expect to match only upper-case letters) actually matches all the
upper-case letters and all but one of the lower-case letters (everything
from 'A' to 'Z') which means A b B c C .... x X y Y z Z (and not 'a')!

The environment variables LC_* determine your "locale" and affect how
programs behave:

    LC_ADDRESS=en_US
    LC_COLLATE=C
    LC_CTYPE=en_US
    LC_IDENTIFICATION=en_US
    LC_MEASUREMENT=en_US
    LC_MESSAGES=en_US
    LC_MONETARY=en_US
    LC_NAME=en_US
    LC_NUMERIC=en_US
    LC_PAPER=en_US
    LC_SOURCED=1
    LC_TELEPHONE=en_US
    LC_TIME=en_US

The master variable "LC_ALL" over-rides them all, if set.  Of particular
concern to shell scripts are LC_CTYPE (the type of characters allowed,
e.g. 7-bit ASCII or full 8-bit iso-latin-1 with accents) and LC_COLLATE
(the order of the characters in the alphabet).

If you use character ranges containing dashes (e.g. [a-z]), you must
set and export the LC_COLLATE "C" locale at the top of your script,
to make sure your ranges match the characters in strict numeric order:

    #!/bin/sh -u
    PATH=/bin:/usr/bin ; export PATH
    umask 022
    LC_COLLATE=C ; export LC_COLLATE   # collate in strict numeric order

------------------------------------------------
Internationalization and POSIX character classes
------------------------------------------------

In an international (non-English) world where characters include accents,
dashed ranges such as [a-z] and [A-Z] are wrong.  These ranges may not
match accented characters at all, either upper- or lower-case, and they
can mis-handle alphabets with upper-case and lower-case collated together.

If the LC_COLLATE order is set to strict numeric order ("C"), dashed
ranges behave predictably:

    $ unset LC_ALL
    $ LC_CTYPE=en_US ; export LC_CTYPE   # accept iso-latin-1 characters
    $ LC_COLLATE=C ; export LC_COLLATE   # collate in strict numeric order
    $ touch a A b B c C x X y Y z Z
    $ touch                          # four latin-1 accented characters
    $ ls
    A  B  C  X  Y  Z  a  b  c  x  y  z        
    $ ls | sort | fmt
    A B C X Y Z a b c x y z    
    $ echo [a-z]
    a b c x y z
    $ echo [A-Z]
    A B C X Y Z

The above shows that the latin-1 characters sort to the end (they are
high-value 8-bit characters) and are not matched by the GLOB ranges in
a strict numeric order collating sequence such as LC_COLLATE=C.

If we change the collating sequence away from strict numeric "C", the
GLOB ranges match a somewhat non-intuitive set of characters:

    $ unset LC_ALL
    $ LC_CTYPE=en_US ; export LC_CTYPE      # accept iso-latin-1 characters
    $ LC_COLLATE=en_US ; export LC_COLLATE  # collate together
    $ ls
    a  A      b  B  c  C      x  X  y  Y  z  Z
    $ ls | sort | fmt
    a A   b B c C   x X y Y z Z
    $ echo [a-z]
    a A   b B c C   x X y Y z           # note missing 'Z'
    $ echo [A-Z]
    A   b B c C   x X y Y z Z           # note missing 'a'

Instead of using dashed character ranges (which misbehave, as you can see
above), many matching systems let you specify a POSIX standard "class"
of characters to match by name (e.g. "lower" and "upper"), and these *do*
work correctly to match even accented characters:

    $ unset LC_ALL
    $ LC_CTYPE=en_US ; export LC_CTYPE      # accept iso-latin-1 characters
    $ LC_COLLATE=C ; export LC_COLLATE      # collate in strict numeric order
    $ echo [[:lower:]]
    a b c x y z                           # all lower-case, nothing missing
    $ echo [[:upper:]]
    A B C X Y Z                           # all upper-case, nothing missing
    $ LC_COLLATE=en_US ; export LC_COLLATE  # collate together
    $ echo [[:lower:]]
    a  b c  x y z                         # all lower-case, nothing missing
    $ echo [[:upper:]]
    A  B C  X Y Z                         # all upper-case, nothing missing
    $ LC_CTYPE=C ; export LC_CTYPE          # accept only plain ASCII
    $ echo [[:lower:]]
    a b c x y z                             # only lower-case ASCII now
    $ echo [[:upper:]]
    A B C X Y Z                             # only upper-case ASCII now

While the order of the characters in the POSIX class changes with the
collating order, the list of characters matched does not - it is always
the correct list for the given CTYPE locale.  Contrast this with the
dashed [a-z] range used above, where the list of characters matched
changed non-intuitively depending on the collating order selected.

In multi-lingual countries such as Canada, pathnames will often contain
accents.  Your programs need to handle them correctly.  Avoid character
ranges containing dashes, and use the POSIX character classes that aren't
affected by the character collating sequence being used:

    $ rm [a-z]*          # WRONG - dependent on collating order
    $ rm [[:lower:]]*    # RIGHT - use the POSIX class that always works

To be safe, always start your scripts with a correct setting of LC_COLLATE:

    #!/bin/sh -u
    PATH=/bin:/usr/bin ; export PATH
    umask 022
    LC_COLLATE=C ; export LC_COLLATE   # collate in strict numeric order

-------------
Character Set
-------------

Many non-English languages have characters that don't fit into 8-bit
bytes.  The world has adopted standards such as UTF and UNICODE to
allow for multi-byte characters, and many (but not all) Unix/Linux
programs know how to process files with multi-byte characters.

What happens in a script when a program such as wc (word count) counts
the words and characters in a file?  If the file contains multi-byte
characters, should wc treat the multi-bytes as single characters,
or should wc count each byte as a separate character?  Should wc treat
non-ASCII bytes as word separators, or as parts of multi-byte characters?

Usually, there is no indication of which multi-byte standard is in use in
a text file - one might find UTF and UNICODE files in the same directory,
and wc is sure to do the wrong thing with one or the other of the files.

The LC_* and LANG environment variables affect how programs such as wc
interpret "characters" in files.  If they are set to anything other
than the "C" setting, you may find that some programs misbehave when
processing files that appear to have multi-byte characters in them.

Unless you are certain of your character set, your scripts must first
pre-emptively set the LANG and/or LC_COLLATE and/or LC_ALL variables to
"C" to prevent undefined or inconsistent behaviour:

    #!/bin/sh -u
    PATH=/bin:/usr/bin ; export PATH
    umask 022
    LC_COLLATE=C ; export LC_COLLATE   # collate in strict numeric order
    LANG=C ; export LANG               # don't process multi-byte chars

-- 
| Ian! D. Allen  -  idallen@idallen.ca  -  Ottawa, Ontario, Canada
| Home Page: http://idallen.com/   Contact Improv: http://contactimprov.ca/
| College professor (Free/Libre GNU+Linux) at: http://teaching.idallen.com/
| Defend digital freedom:  http://eff.org/  and have fun:  http://fools.ca/
