[lextypes] Holus bolus: a general-purpose datatype library

John Cowan cowan at mercury.ccil.org
Mon Jul 21 09:33:10 BST 2003


(For some reason, I keep wanting to type the address as
"lexhack at xmltype.com".  Go figure.)

This is Draft 0 of Holus Bolus, intended as a complete replacement for
the proposed DSDL part 5 datatypes based on WXS.  It handles essentially
everything WXS does, is more sensibly organized, and has much better
support for datatyping localized data.

The name is based on a remark of Rick Jelliffe's (who has seen this
proposal but not yet commented).  This is intended to complement, not
compete with, Jeni's DTL.  My guess is that it would be straightforward
to implement this in DTL.



0) A few fundamental points:

a) There are 8 basic types: string, number, boolean, time, duration,
octetSequence, URI, QName.

b) Each basic type has a set of parameters.  A non-basic type is
constructed by specifying values for the some or all of the parameters
associated with the basic type.  Parameters not specified assume a
default value.  There is a relationship between parameter values called
subordination, separately defined for each parameter.

c) A type T is a subtype of a supertype T' if T and T' have the same
basic type and each parameter of T has a value subordinate to the value
of the corresponding parameter of T'.

d) Every type has a parameter called pattern, which is a regular
expression constraining the lexical representations of the type.
The default value is ".*" (matches everything).  A pattern value V is
subordinate to another pattern value V' if every string that matches V
also matches V'.  The syntax of patterns is FIXME.


1) Strings have unconstrained lexical representations, and the value
is the same as the representation.  The parameters of the string basic
type are:

a) maxLength specifies the maximum length of the string in characters.
Default Inf.  Smaller values are subordinate to larger values.

b) minLength specifies the minimum length of the string in characters.
Default 0.  Larger values are subordinate to smaller values.

c) whiteSpace specifies whether strings that differ only in whitespace
have the same value.  If whiteSpace is "preserve", then no.  If whiteSpace
is "replace", then only if the strings are the same when all whitespace
characters are changed to spaces.  If whiteSpace is "collapse", then
only if the strings are the same when all whitespace characters are
changed to spaces, all runs of spaces are changed to a single space,
and any leading and/or trailing space is removed.  If whiteSpace is
"remove", then yes.  Subordination is FIXME.


2) Numbers have lexical representations involving an optional sign,
digits, an optional decimal point, an optional exponent specifier, and
an optional slash representing a fraction.  In addition, "Inf", "+Inf",
"-Inf", and "NaN", case-insensitive, are valid lexical representations
of numbers.  The values are the representable approximations to real
numbers, plus +Inf, -Inf, and NaN.  Whitespace is ignored.  The parameters
of the number basic type are as follows:

a) minimum specifies the minimum value, using an integer whose base and
precision are those specified for the type.  Default -Inf.  Larger values
are subordinate to smaller ones.

b) minimumKind is "inclusive" if the minimum value is included in the
type, or "exclusive" if not.  Default inclusive.  Exclusive is subordinate
to inclusive.

c) maximum specifies the maximum value, using an integer whose base and
precision are those specified for the type.  Default +Inf.  Smaller values
are subordinate to larger ones.

d) minimumKind is "inclusive" if the minimum value is included in the
type, or "exclusive" if not.  Default inclusive.  Exclusive is subordinate
to inclusive.

e) precision specifies the precision within which two representations are
value-equivalent.  Specifically, if R and R' are two representations which
(considered as exact values) differ by no more than 10^-P where P is the
precision, then they are the same value.  Default Inf.  Smaller precisions
are subordinate to large ones.

f) FIXME need something here to distinguish between fixed-point and
floating-point precision rules.  The one given in e) is fixed-point only.

g) base specifies the base: legal values are 2, 8, 10, 16.
No subordination.

h) allowNaN is true if NaN is an allowed value and false if not.
False is subordinate to true.

i) decimalPoint is "." or ",".  Default ".".  If the value is ".",
then commas are ignored.  This is a lexical parameter.  No subordination.


3) Boolean has two values, true and false.  Whitespace is ignored  The
facets are as follows:

a) truereps is a set of tokens representing the possible representations
of the true value.  Default {"1", "true"}.  Subordination is subset.

a) falsereps is a set of tokens representing the possible representations
of the false value.  Default {"0", "false"}.  Subordination is subset.


4) Time represents intervals of time, possibly periodic.  Representations
are digit or letter sequences separated by non-digit-letter sequences,
except that T is a separator, and "." in a seconds value is not a
separator.  Whitespace is ignored.  The facets of the time basic type
are as follows:

a-d) min, max, minKind, maxKind are just like number's facets.

e) components uses the letters Y, M, D, h, m, s, z to specify the order
of components in a time: year, month, day, hour, minute, second, timezone.
This is a lexical parameter.  Component value V is subordinate to value V'
if it is an anchored substring of V' (beginMatch property).

f) monthNames is a sequence of 12 tokens representing the names of
the months, case insensitive, or an empty sequence, in which case the
months have no names.  Default empty.  This is a lexical parameter.
No subordination.

g) defaultTimezone specifies the default timezone as +hh:mm or Z.
Default Z.  No subordination.

h,i) period specifies the repetition period of this time; periodUnit
specifies the unit of the repetition period, either "seconds" or
"gMonths".  Leap seconds are ignored in figuring repetition periods.
No subordination.

j,k) duration specifies the duration of this time; durationUnit specifies
the unit of the duration, either "seconds" or "gMonths".  Leap seconds
are ignored in figuring durations.  No subordination.


5) Duration uses ISO 8601 lexical representation.  Whitespace is ignored.
The parameters of the duration basic type are:

a-d) minimum, minimumKind, maximum, maximumKind are analogous to those
of number and time.

e,f) precision and precisionUnit specify the precision and the unit of
precision (either "seconds" or "gMonths") of the duration.  No subordination.


6) The octetSequence type represents a sequence of octets.
Whitespace is ignored.  The parameters of the octetSequence type are:

a,b) minLength and maxLength are analogous to the parameters of the
string type, but measured in octets, not characters.

c) representation is either "base64" or "hex" and specifies the lexical
representation.  No subordination.


7) The URI type represents a generalized URI.  Lexical representations
containing non-URI characters are equivalent to those with UTF-8 based
%xx escapes.  There are no parameters of this type.


8) The QName type has the syntax of a QName.  Whitespace is collapsed.
Prefixes are equivalent in value if the namespaces they are bound to
are identical.  There are no parameters of this type.


###########

-- 
John Cowan       http://www.ccil.org/~cowan        <jcowan at reutershealth.com>
        You tollerday donsk?  N.  You tolkatiff scowegian?  Nn.
        You spigotty anglease?  Nnn.  You phonio saxo?  Nnnn.
                Clear all so!  `Tis a Jute.... (Finnegans Wake 16.5)



More information about the lextypes mailing list