[phpxmlrpc] xmlrpc_encode_entitites causing parse error

Wed Nov 16 16:33:19 GMT 2005

OK, code checked in into CVS. Feel free to download and test it (I added a new test case for UTF-8 in testsuite, but the more testing the better).

I adopted the 'convert all to ASCII' way-of-life, and modified the function xmlrpc_encode_entities() to respect the value of $GLOBALS['xmlrpc_internalencoding'].

As stated in my last post, more flexible usage patterns might make it into future releases.

Right now escaping iso-8859-1 might be faster than it was previously, since I use str_replace instead of the hand-made algorithm, but escaping UTF8 will be dog slow.
The lib is not built for speed anyway, if you're aiming for that the php xmlrpc extension will surely server you better.

The main problem I see with that right now are:

- turning xmlrpc_encode_entities() into a general charset transcoder migth make it slower for the default case operation, unless user has mbstring ON

- how server and msg objs will communicate to xmlrpcval objs the desired charset for serialization (only solution I can think of: add an extra param in calls to serialize())

- xmlrpc_encode_entities() is used when serializing server-added debug info. Since that info might come at the same time from user messages, client request (at debug lvl 3) and php error messages, there is a serious risk it will be a charset pot-pourri, ie there is no sure way that it will conform to ANY charset.
I wonder if using a CDATA section instead of a comment to wrap debug info might help in solving this problem.
The second solution is to just base64-encode the debug info, and let the client sort it out.
Of course that would break any existing client that makes usage of that undocumented info...

Bye
Gaetano

> -----Original Message-----
> From: a.h.s. boy (lists) [mailto:spudlists at nothingness.org]
> Sent: Tuesday, November 15, 2005 6:57 PM
> To: Gaetano Giunta
> Cc: phpxmlrpc at lists.usefulinc.com
> Subject: Re: [phpxmlrpc] xmlrpc_encode_entitites causing parse error
> 
> 
> On Nov 15, 2005, at 11:31 AM, Gaetano Giunta wrote:
> 
> > Very toughtful response.
> 
> Man, I love cross-linguistic typos...makes great new English words:  
> "toughtful" = "tough thoughtfulness". Brilliant.
> 
> > UTF-8 everywhere is fine and dandy but for 2 aspects:
> >
> > - in fact XML-over-http without a charset declaration SHOULD be  
> > assumed to be ISO-8859-1 (there is a RFC somewhere about that,  
> > which I cannot recall now).
> 
> Hmmm. The XML 1.0 spec (http://www.w3.org/TR/2000/REC-xml-20001006)  
> reads:
> 
> Because each XML entity not accompanied by external encoding  
> information and not in UTF-8 or UTF-16 encoding MUST begin with an  
> XML encoding declaration, in which the first characters must be '<? 
> xml', any conforming processor can detect, after two to four octets  
> of input, which of the following cases apply.
> 
> RFC 2376, however, offers suggestions for XML MIME-types sent over  
> HTTP, but it reads (pardon the length):
> 
> Although listed as an optional parameter, the use of the charset
>        parameter is STRONGLY RECOMMENDED, since this 
> information can be
>        used by XML processors to determine authoritatively 
> the character
>        encoding of the XML entity. The charset parameter can also be  
> used
>        to provide protocol-specific operations, such as charset-based
>        content negotiation in HTTP.  "UTF-8" [RFC-2279] is the
>        recommended value, representing the UTF-8 charset. UTF-8 is
>        supported by all conforming XML processors [REC-XML].
> 
>        If the XML entity is transmitted via HTTP, which uses 
> a MIME-like
>        mechanism that is exempt from the restrictions on the text top-
>        level type (see section 19.4.1 of HTTP 1.1 
> [RFC-2068]), "UTF-16"
>        (Appendix C.3 of [UNICODE] and Amendment 1 of [ISO-10646]) is  
> also
>        recommended.  UTF-16 is supported by all conforming XML  
> processors
>        [REC-XML].  Since the handling of CR, LF and NUL for text  
> types in
>        most MIME applications would cause undesired transformations of
>        individual octets in UTF-16 multi-octet characters, 
> gateways from
>        HTTP to these MIME applications MUST transform the XML entity  
> from
>        a text/xml; charset="utf-16" to application/xml;  
> charset="utf-16".
> 
>        Conformant with [RFC-2046], if a text/xml entity is 
> received with
>        the charset parameter omitted, MIME processors and XML 
> processors
>        MUST use the default charset value of "us-ascii".  In 
> cases where
>        the XML entity is transmitted via HTTP, the default 
> charset value
>        is still "us-ascii".
> 
> ...which implies that us-ascii, not iso-8859-1, is the default (but  
> not really a problem if you're encoding everything outside of 
> ASCII).  
> But I know that my RDFParser class, for example, defaults to "utf-8"  
> and overrides that only if the encoding is specified as something  
> else in the xml delaration. I assume I made that decision for good  
> reasons, though I don't remember them now!
> 
> Still, the number of factors affecting encoding and transmission are  
> unbelievably complex. In my software, for example, there is:
> 
> 1) Page encoding used when users submit data via a form (mine: UTF-8)
>     a) Default charset header sent by Apache (mine:  UTF-8)
>     b) Default charset set in META tags (mine: UTF-8)
>     c) Charset setting of client browser (no control!)
> 2) Encoding of database (mine: MySQL 3.x, so limited to ISO-8859-1)
> 3) Encoding of page used to display data (Irrelevant to XML-RPC  
> transfers, but 1a,1b,1c apply)
> 4) PHP internal encoding
> 5) XMLRPC library internal encoding
> 6) XML declaration charset (optional, but highly recommended by spec)
> 7) text/xml MIME type charset declaration (optional, mine: text/ 
> xml;charset=utf-8)
> 8) application/xml MIME type charset declaration (optional)
> 
> ...and since all of them could be set to different encodings, 
> getting  
> it all straight is a dizzying adventure. Add to that the complexity  
> of handling things like users copying text from a Word document  
> created in Windows-1252 and pasting into a form on a UTF-8 page,  
> and...ugh! Sometimes I just want to kill myself.
> 
> While I suppose that attempting to convert all data into us-ascii  
> through entity encoding gives us the "least common donominator"  
> solution -- make everything 7-bit! -- it obviously isn't working  
> perfectly. So perhaps any solution that simply makes it work,  
> regardless of whether or not it changes the use of  
> $xmlrpc_internalencoding, would be good. I did wonder about the  
> utf8_encode() function, and why you didn't simply use that 
> instead of  
> $character = ("&#".strval($code).";"); Won't that do all the right  
> work for you?
> 
> In any case, I think you should try to make the XMLRPC 
> library follow  
> as closely as possible the relevant spec/RFC "recommended" behavior,  
> and let that be your guide.
> 
> > Adding some extra settings to client/server objects is fine, but  
> > the causal user might not be used to using those, and backward  
> > compatability is a primary concern to me.
> > Traduced in code that would probably mean adding some hacky stuff  
> > of the sort "object default charset preference is undefined, and  
> > while still undefined use global variable, otherwise use object  
> > preference" (doable but ugly).
> > The though part is letting the client object communicate the  
> > desired charset encoding to the xmlrpcval object, since the  
> > responsibility of creating serialized content is left to the  
> > xmlrpcval object itself (and I'm surely not changing that  
> > fundamental assumption).
> 
> If you converted $xmlrpc_internalencoding to a property of xmlrpcmsg  
> instead of a global variable, then you could simply set it to 
> default  
> to "iso-8859-1" in the constructor method for the class object. So  
> you maintain your default, but allow users to reset it through  
> scripting.
> 
> > ps: the real (only ?) advantage of using variables instead of  
> > constnts for things such as internal_encoding is that you can  
> > redefine them not inside the xmlrpc lib but just after its  
> > inclusion, eg.
> > <?php include('xmlrpc.inc'); $xmlrpc_internal_encoding = 'UTF-16';  
> > echo 'etc...'; ?>
> > this way you do not have to change anything when updating...
> 
> Ah, yes, this is true, and I hadn't really thought of such a simple  
> thing (but the same method holds true for using an object property).
> 
> How the PEAR people are handling this:
> http://pear.php.net/bugs/bug.php?id=52
> ["According to RFC 3023 section 3.1, the encoding specified in the <? 
> xml encoding=... ?> tag should be ignored for XML received over HTTP  
> in favor of the encoding specified in the Content-Type header (e.g.  
> "Content-Type: text/xml; charset=iso-8859-1)."]
> 
> I found another developer reflecting on these same questions, for a  
> blogging app that uses XML-RPC:
> http://ecto.kung-foo.tv/archives/000975.php
> 
> Other messages about the default encoding of unspecified xml 
> documents:
> http://groups.yahoo.com/group/xml-rpc/message/45
> http://mail.zope.org/pipermail/zope-collector-monitor/2004-October/ 
> 004361.html (in reverse chronological order)
> 
> Cheers,
> spud.
> 
> -------------------------------------------------------------------
> a.h.s. boy
> spud(at)nothingness.org            "as yes is to if,love is to yes"
> http://www.nothingness.org/
> -------------------------------------------------------------------
> 
>