[phpxmlrpc] xmlrpc_encode_entitites causing parse error
a.h.s. boy (lists)
spudlists at nothingness.org
Thu Nov 17 06:19:44 GMT 2005
I grabbed a copy from CVS, but I'm in the middle of a few days of
hardcode iCalendar coding, so I'm focusing on that. I'll run some
tests and offer comments as soon as I have the chance. Thanks for the
On Nov 16, 2005, at 11:33 AM, Gaetano Giunta wrote:
> OK, code checked in into CVS. Feel free to download and test it (I
> added a new test case for UTF-8 in testsuite, but the more testing
> the better).
> I adopted the 'convert all to ASCII' way-of-life, and modified the
> function xmlrpc_encode_entities() to respect the value of $GLOBALS
> As stated in my last post, more flexible usage patterns might make
> it into future releases.
> Right now escaping iso-8859-1 might be faster than it was
> previously, since I use str_replace instead of the hand-made
> algorithm, but escaping UTF8 will be dog slow.
> The lib is not built for speed anyway, if you're aiming for that
> the php xmlrpc extension will surely server you better.
> The main problem I see with that right now are:
> - turning xmlrpc_encode_entities() into a general charset
> transcoder migth make it slower for the default case operation,
> unless user has mbstring ON
> - how server and msg objs will communicate to xmlrpcval objs the
> desired charset for serialization (only solution I can think of:
> add an extra param in calls to serialize())
> - xmlrpc_encode_entities() is used when serializing server-added
> debug info. Since that info might come at the same time from user
> messages, client request (at debug lvl 3) and php error messages,
> there is a serious risk it will be a charset pot-pourri, ie there
> is no sure way that it will conform to ANY charset.
> I wonder if using a CDATA section instead of a comment to wrap
> debug info might help in solving this problem.
> The second solution is to just base64-encode the debug info, and
> let the client sort it out.
> Of course that would break any existing client that makes usage of
> that undocumented info...
>> -----Original Message-----
>> From: a.h.s. boy (lists) [mailto:spudlists at nothingness.org]
>> Sent: Tuesday, November 15, 2005 6:57 PM
>> To: Gaetano Giunta
>> Cc: phpxmlrpc at lists.usefulinc.com
>> Subject: Re: [phpxmlrpc] xmlrpc_encode_entitites causing parse error
>> On Nov 15, 2005, at 11:31 AM, Gaetano Giunta wrote:
>>> Very toughtful response.
>> Man, I love cross-linguistic typos...makes great new English words:
>> "toughtful" = "tough thoughtfulness". Brilliant.
>>> UTF-8 everywhere is fine and dandy but for 2 aspects:
>>> - in fact XML-over-http without a charset declaration SHOULD be
>>> assumed to be ISO-8859-1 (there is a RFC somewhere about that,
>>> which I cannot recall now).
>> Hmmm. The XML 1.0 spec (http://www.w3.org/TR/2000/REC-xml-20001006)
>> Because each XML entity not accompanied by external encoding
>> information and not in UTF-8 or UTF-16 encoding MUST begin with an
>> XML encoding declaration, in which the first characters must be '<?
>> xml', any conforming processor can detect, after two to four octets
>> of input, which of the following cases apply.
>> RFC 2376, however, offers suggestions for XML MIME-types sent over
>> HTTP, but it reads (pardon the length):
>> Although listed as an optional parameter, the use of the charset
>> parameter is STRONGLY RECOMMENDED, since this
>> information can be
>> used by XML processors to determine authoritatively
>> the character
>> encoding of the XML entity. The charset parameter can also be
>> to provide protocol-specific operations, such as charset-based
>> content negotiation in HTTP. "UTF-8" [RFC-2279] is the
>> recommended value, representing the UTF-8 charset. UTF-8 is
>> supported by all conforming XML processors [REC-XML].
>> If the XML entity is transmitted via HTTP, which uses
>> a MIME-like
>> mechanism that is exempt from the restrictions on the text
>> level type (see section 19.4.1 of HTTP 1.1
>> [RFC-2068]), "UTF-16"
>> (Appendix C.3 of [UNICODE] and Amendment 1 of [ISO-10646]) is
>> recommended. UTF-16 is supported by all conforming XML
>> [REC-XML]. Since the handling of CR, LF and NUL for text
>> types in
>> most MIME applications would cause undesired
>> transformations of
>> individual octets in UTF-16 multi-octet characters,
>> gateways from
>> HTTP to these MIME applications MUST transform the XML entity
>> a text/xml; charset="utf-16" to application/xml;
>> Conformant with [RFC-2046], if a text/xml entity is
>> received with
>> the charset parameter omitted, MIME processors and XML
>> MUST use the default charset value of "us-ascii". In
>> cases where
>> the XML entity is transmitted via HTTP, the default
>> charset value
>> is still "us-ascii".
>> ...which implies that us-ascii, not iso-8859-1, is the default (but
>> not really a problem if you're encoding everything outside of
>> But I know that my RDFParser class, for example, defaults to "utf-8"
>> and overrides that only if the encoding is specified as something
>> else in the xml delaration. I assume I made that decision for good
>> reasons, though I don't remember them now!
>> Still, the number of factors affecting encoding and transmission are
>> unbelievably complex. In my software, for example, there is:
>> 1) Page encoding used when users submit data via a form (mine: UTF-8)
>> a) Default charset header sent by Apache (mine: UTF-8)
>> b) Default charset set in META tags (mine: UTF-8)
>> c) Charset setting of client browser (no control!)
>> 2) Encoding of database (mine: MySQL 3.x, so limited to ISO-8859-1)
>> 3) Encoding of page used to display data (Irrelevant to XML-RPC
>> transfers, but 1a,1b,1c apply)
>> 4) PHP internal encoding
>> 5) XMLRPC library internal encoding
>> 6) XML declaration charset (optional, but highly recommended by spec)
>> 7) text/xml MIME type charset declaration (optional, mine: text/
>> 8) application/xml MIME type charset declaration (optional)
>> ...and since all of them could be set to different encodings,
>> it all straight is a dizzying adventure. Add to that the complexity
>> of handling things like users copying text from a Word document
>> created in Windows-1252 and pasting into a form on a UTF-8 page,
>> and...ugh! Sometimes I just want to kill myself.
>> While I suppose that attempting to convert all data into us-ascii
>> through entity encoding gives us the "least common donominator"
>> solution -- make everything 7-bit! -- it obviously isn't working
>> perfectly. So perhaps any solution that simply makes it work,
>> regardless of whether or not it changes the use of
>> $xmlrpc_internalencoding, would be good. I did wonder about the
>> utf8_encode() function, and why you didn't simply use that
>> instead of
>> $character = ("&#".strval($code).";"); Won't that do all the right
>> work for you?
>> In any case, I think you should try to make the XMLRPC
>> library follow
>> as closely as possible the relevant spec/RFC "recommended" behavior,
>> and let that be your guide.
>>> Adding some extra settings to client/server objects is fine, but
>>> the causal user might not be used to using those, and backward
>>> compatability is a primary concern to me.
>>> Traduced in code that would probably mean adding some hacky stuff
>>> of the sort "object default charset preference is undefined, and
>>> while still undefined use global variable, otherwise use object
>>> preference" (doable but ugly).
>>> The though part is letting the client object communicate the
>>> desired charset encoding to the xmlrpcval object, since the
>>> responsibility of creating serialized content is left to the
>>> xmlrpcval object itself (and I'm surely not changing that
>>> fundamental assumption).
>> If you converted $xmlrpc_internalencoding to a property of xmlrpcmsg
>> instead of a global variable, then you could simply set it to
>> to "iso-8859-1" in the constructor method for the class object. So
>> you maintain your default, but allow users to reset it through
>>> ps: the real (only ?) advantage of using variables instead of
>>> constnts for things such as internal_encoding is that you can
>>> redefine them not inside the xmlrpc lib but just after its
>>> inclusion, eg.
>>> <?php include('xmlrpc.inc'); $xmlrpc_internal_encoding = 'UTF-16';
>>> echo 'etc...'; ?>
>>> this way you do not have to change anything when updating...
>> Ah, yes, this is true, and I hadn't really thought of such a simple
>> thing (but the same method holds true for using an object property).
>> How the PEAR people are handling this:
>> ["According to RFC 3023 section 3.1, the encoding specified in the <?
>> xml encoding=... ?> tag should be ignored for XML received over HTTP
>> in favor of the encoding specified in the Content-Type header (e.g.
>> "Content-Type: text/xml; charset=iso-8859-1)."]
>> I found another developer reflecting on these same questions, for a
>> blogging app that uses XML-RPC:
>> Other messages about the default encoding of unspecified xml
>> 004361.html (in reverse chronological order)
>> a.h.s. boy
>> spud(at)nothingness.org "as yes is to if,love is to yes"
spud(at)nothingness.org "as yes is to if,love is to yes"
More information about the phpxmlrpc