[lextypes] Reconciling DTLL and Holus Bolus parameters

Tue Jul 29 18:49:06 BST 2003

(To keep from having to distinguish between this and that kind of
parameters, I will refer to Holus Bolus (and all like sets of intrinsic as
opposed to user-defined) parameters as properties, reserving the word
parameters to its usage in RELAX NG datatypes and DTLL.)

Jeni wrote:

> I think that it would be possible to alter DTLL to support all these
> different kinds of parameters, by creating a more general
> parameterisation mechanism used to set the values of variables used in
> XPaths etc. But I wonder whether it would be worthwhile placing
> restrictions on what's parameterisable?

I doubt it. There are more parameters in heaven and earth, Horatio... But
there is a middle ground and DTLL is already standing on it. Some of John's
properties, like pattern and whiteSpace, are already given special treatment
in DTLL. It would be possible to incorporate them all in this way, leaving
the parameter space untouched.

To be perfectly clear, what I am proposing is to separate the notion of
parameters, which are user-specified, and properties, which are built-in. I
don't suggest this merely as a compromise. DTLL would be enriched if it were
possible for developers to specify that particular parameters determine the
values of these intrinsic properties. In fact, I will argue that DTLL must
be extended along these lines, to support its own logic.

In DTLL, subtypes are defined based on validity. T1<=T2 (T1 is a subtype of
T2) iff for all values v, valid(v,T1) -> valid(v,T2). (-> is implication)

(I believe this is equivalent to the definition John gave in his recent
"lattice" proposal, and I wrote earlier it is equivalent to the set-based
interpretation: Let L1, L2 be the languages (sets of strings) defined by
types T1 and T2, respectively. Then T1<=T2 iff L1[=L2, where [= is the
subset operator. This is obviously true when the languages are defined
exclusively in terms of the valid() function. Given this, it is trivial to
define a type lattice in the usual way, using union and intersection as the
lattice operators, with the powerset and empty set as the top and bottom,
respectively.)

Unfortunately, DTLL allows subtype relations to be defined that are nonsense
and for which none of these properties hold. For a simple example, let T4 be
the set of strings that are exactly four characters long and T2 the set that
are exactly two characters long. It is easy in DTLL to state that T2<=T4,
but it can never be so. In fact, T2 and T4 are completely unrelated because
there are no strings in T2 that are also in T4, or vice versa.

It isn't possible to mechanically check all possible subtype assertions, but
the inability to check this one is a serious problem. Moreover, there is
ample precedent that it is a mistake that programmers are likely to make.
Suppose, in addition, the characters of T2 and T4 are constrained to be only
the digits 0-9. Since 0-99 is a subrange of 0-9999 numerically, the
programmer can easily be led to believe that this implies T2<=T4. But the
order property for T4 does not hold for the subrange; in fact, values of the
subrange are unordered with respect to the base type. This was the Y2K
problem.

There are an infinite number of possible type parameters and not all of them
are amenable to checking. (An example is easy and instructive. Suppose a
type allows strings that are valid XPath expressions and the parameter "fun"
accepts as value the definition of an XPath function that can be used in an
expression of the type.) So which parameters deserve promotion to property
status?

The WXS datatypes facets are a pretty good start (the problem is what they
made of them; most of the facets are fine) augmented by John's list and the
demands of logic. These arise from mathematical necessity and common use
cases. If, of course, the definitions need to be changed, the names should
be changed, as well.

I'm not equally enthusiastic about all facets. "length", for example, is
just syntactic sugar. It should be possible/easy to define a length
parameter that sets the minLength and maxLength properties to the same
value, but it is not needed as an intrinsic property. I think I would prefer
to use the WXS facet names as property names if they have the same
definition, but I'm not aware of all the reasons John made the choices he
did. (I just read his "keep the properties independent" criterion as I was
preparing to send this. Makes sense to me.)

Also, it should be possible for the developer to declare values for all
facets, which it is not in WXS, where some are merely an explanitory gloss
for the set of built-in types. For example, it must be possible to declare
that a type is unordered and/or incomparable.

I am sure that if agreement in principle is reached about the need for
additional "properties", it will be possible to prioritize the candidates.
So I'll stop here and wait for comments.

(I know that many of you are headed for Montreal, but doggone it, I'm stuck
here and would like to hear your reactions. ;-)

Bob Foster