Sorting and Collation Explained
Because of specific of Vista products and in according to my
observations, collation is one of most tricky questions I've met in solving
different customer's problems. I want to summarize some points in order to
anser a questions on the NGs, and decided to spend time and summarize a clear
explanation.
Collation is the way to place different symbols and/or symbol
groups in some unique order allowing you to apply the definite comparative rule
'less than or equal to' between words (words in general, not individual chars)
in order to set or predict the position for an old or new word in that ordered
set. Basically, the order of sorting words.
Historically the comparing of binary ASCII values was the
original way to set the comparative rule for a symbols. In DOS world it was
enough to compare char values in order to decide where it have been placed in
datasheet. Words comparing is made symbol by symbol after that.
It is used widely in Clipper and in early versions of Fox. And
this old practice forced many Windows programmers to save non-linguistic
context in Character fields. Latest Clipper International releases allowed the
ability to use different European languages and set the way to sort symbols
linguistically (== alphabetically) rather than their ASCII values. So, the
comparative rule 'less than or equal to' became grounded on the char's index in
the so called _collation_table_.
As for now, symbols are compared not by their value, but by
their indexes in the _collation_table_. Usually the reality brings more
complicated problems. You can see the CollationExample with Spanish words
sorted rather than individual letters. The same is true for a number of other
languages. An additional problem with collation is that indexing on individual
letters is not enough -- entire words must be taken into context in order to be
sorted. There are even different approaches to sort symbol groups. Duden and
ETEC supported by Apollo are examples of different group sorting rules. It
expands additionally the _collation_table_ term.
Another problem is Characters from the upper half of the ASCII
table can have different values in DOS and Win32. So the collation_tables_ is
different for DOS and Windows.
All of these listed aspects of collation have generated a new
level for the problem: the problem of compatibility between applications. Or
it's more correct to say, the problem of sharing indexes between applications.
In xBase, the comparative rule is applied not to data directly, but to the key
set, determined after an xBase index expression is calculated for records. The
key is saved in the index tree. The index tree structure is organized according
to the same 'less than or equal to' rule. Because the index tree data structure
depends on the comparative rule, it means this rule must be:
- a) the same for all applications which are using the index
simultaneously
- b) the same for index creation time moment and for a latest
moments of seek /delete/append operations in index tree
This is the main point for collation. We need to use same
collation in applications in order to share indexes! By the way, most DBMS'
working in client/server mode do not have to worry about collation ever. This
collation issue is purely a "file-server" mode specific problem. With
Apollo/SDE version 6.x, you can create your own collation table in order to
build 'less than or equal to' rules used by SDE for both DOS and Windows
charsets. As I mentioned above, you can use ETEC and Duden in order to compare
word groups. It leads to a problem of identification of collation rules used in
indexes. The main one, what to do if an index was created in a one application
(or even apllication version) and then used by another application.
Actually the issue is caused by a lack of DBF specifications,
which suggests nothing about collation in indexes. In according to MSDN the
official FoxPro table specification sets a byte 'Code page mark' in DBF header,
but unfortunately it says nothing about the collation rules used. Additionally,
it seems there are another two bytes reserved in order to mark collation but
this is my guess and cannot be confirmed. An official specification of the Fox
format referred to these bytes as "reserved" and asked Visual FoxPro users to
place respective operators into a config file.
Both Clipper and the latest version of Visual FoxPro use their
own "suggestions" about collation in indexes. Seems Clipper did nothing at all
in order to write external reference for information about used collation. It
was assumed the use of the International Edition of Clipper has different
libraries for different languages. Apollo uses the same core engine SDE DLLs
for all languages, therefore having hundreds of language-oriented DLLs is
clearly not the solution.
The Apollo database engine orients collation onto the current
locale, which is set by Windows. Apollo users must think about collation
synchronization manually, if their application shares indexes on the net or
with a DOS application.
|