[phpxmlrpc] xmlrpc_encode_entitites causing parse error

a.h.s. boy (lists) spudlists at nothingness.org
Thu Nov 17 06:19:44 GMT 2005


I grabbed a copy from CVS, but I'm in the middle of a few days of  
hardcode iCalendar coding, so I'm focusing on that. I'll run some  
tests and offer comments as soon as I have the chance. Thanks for the  
quick work!

Cheers,
spud.

On Nov 16, 2005, at 11:33 AM, Gaetano Giunta wrote:

> OK, code checked in into CVS. Feel free to download and test it (I  
> added a new test case for UTF-8 in testsuite, but the more testing  
> the better).
>
> I adopted the 'convert all to ASCII' way-of-life, and modified the  
> function xmlrpc_encode_entities() to respect the value of $GLOBALS 
> ['xmlrpc_internalencoding'].
>
> As stated in my last post, more flexible usage patterns might make  
> it into future releases.
>
> Right now escaping iso-8859-1 might be faster than it was  
> previously, since I use str_replace instead of the hand-made  
> algorithm, but escaping UTF8 will be dog slow.
> The lib is not built for speed anyway, if you're aiming for that  
> the php xmlrpc extension will surely server you better.
>
> The main problem I see with that right now are:
>
> - turning xmlrpc_encode_entities() into a general charset  
> transcoder migth make it slower for the default case operation,  
> unless user has mbstring ON
>
> - how server and msg objs will communicate to xmlrpcval objs the  
> desired charset for serialization (only solution I can think of:  
> add an extra param in calls to serialize())
>
> - xmlrpc_encode_entities() is used when serializing server-added  
> debug info. Since that info might come at the same time from user  
> messages, client request (at debug lvl 3) and php error messages,  
> there is a serious risk it will be a charset pot-pourri, ie there  
> is no sure way that it will conform to ANY charset.
> I wonder if using a CDATA section instead of a comment to wrap  
> debug info might help in solving this problem.
> The second solution is to just base64-encode the debug info, and  
> let the client sort it out.
> Of course that would break any existing client that makes usage of  
> that undocumented info...
>
> Bye
> Gaetano
>
>> -----Original Message-----
>> From: a.h.s. boy (lists) [mailto:spudlists at nothingness.org]
>> Sent: Tuesday, November 15, 2005 6:57 PM
>> To: Gaetano Giunta
>> Cc: phpxmlrpc at lists.usefulinc.com
>> Subject: Re: [phpxmlrpc] xmlrpc_encode_entitites causing parse error
>>
>>
>> On Nov 15, 2005, at 11:31 AM, Gaetano Giunta wrote:
>>
>>> Very toughtful response.
>>
>> Man, I love cross-linguistic typos...makes great new English words:
>> "toughtful" = "tough thoughtfulness". Brilliant.
>>
>>> UTF-8 everywhere is fine and dandy but for 2 aspects:
>>>
>>> - in fact XML-over-http without a charset declaration SHOULD be
>>> assumed to be ISO-8859-1 (there is a RFC somewhere about that,
>>> which I cannot recall now).
>>
>> Hmmm. The XML 1.0 spec (http://www.w3.org/TR/2000/REC-xml-20001006)
>> reads:
>>
>> Because each XML entity not accompanied by external encoding
>> information and not in UTF-8 or UTF-16 encoding MUST begin with an
>> XML encoding declaration, in which the first characters must be '<?
>> xml', any conforming processor can detect, after two to four octets
>> of input, which of the following cases apply.
>>
>> RFC 2376, however, offers suggestions for XML MIME-types sent over
>> HTTP, but it reads (pardon the length):
>>
>> Although listed as an optional parameter, the use of the charset
>>        parameter is STRONGLY RECOMMENDED, since this
>> information can be
>>        used by XML processors to determine authoritatively
>> the character
>>        encoding of the XML entity. The charset parameter can also be
>> used
>>        to provide protocol-specific operations, such as charset-based
>>        content negotiation in HTTP.  "UTF-8" [RFC-2279] is the
>>        recommended value, representing the UTF-8 charset. UTF-8 is
>>        supported by all conforming XML processors [REC-XML].
>>
>>        If the XML entity is transmitted via HTTP, which uses
>> a MIME-like
>>        mechanism that is exempt from the restrictions on the text  
>> top-
>>        level type (see section 19.4.1 of HTTP 1.1
>> [RFC-2068]), "UTF-16"
>>        (Appendix C.3 of [UNICODE] and Amendment 1 of [ISO-10646]) is
>> also
>>        recommended.  UTF-16 is supported by all conforming XML
>> processors
>>        [REC-XML].  Since the handling of CR, LF and NUL for text
>> types in
>>        most MIME applications would cause undesired  
>> transformations of
>>        individual octets in UTF-16 multi-octet characters,
>> gateways from
>>        HTTP to these MIME applications MUST transform the XML entity
>> from
>>        a text/xml; charset="utf-16" to application/xml;
>> charset="utf-16".
>>
>>        Conformant with [RFC-2046], if a text/xml entity is
>> received with
>>        the charset parameter omitted, MIME processors and XML
>> processors
>>        MUST use the default charset value of "us-ascii".  In
>> cases where
>>        the XML entity is transmitted via HTTP, the default
>> charset value
>>        is still "us-ascii".
>>
>> ...which implies that us-ascii, not iso-8859-1, is the default (but
>> not really a problem if you're encoding everything outside of
>> ASCII).
>> But I know that my RDFParser class, for example, defaults to "utf-8"
>> and overrides that only if the encoding is specified as something
>> else in the xml delaration. I assume I made that decision for good
>> reasons, though I don't remember them now!
>>
>> Still, the number of factors affecting encoding and transmission are
>> unbelievably complex. In my software, for example, there is:
>>
>> 1) Page encoding used when users submit data via a form (mine: UTF-8)
>>     a) Default charset header sent by Apache (mine:  UTF-8)
>>     b) Default charset set in META tags (mine: UTF-8)
>>     c) Charset setting of client browser (no control!)
>> 2) Encoding of database (mine: MySQL 3.x, so limited to ISO-8859-1)
>> 3) Encoding of page used to display data (Irrelevant to XML-RPC
>> transfers, but 1a,1b,1c apply)
>> 4) PHP internal encoding
>> 5) XMLRPC library internal encoding
>> 6) XML declaration charset (optional, but highly recommended by spec)
>> 7) text/xml MIME type charset declaration (optional, mine: text/
>> xml;charset=utf-8)
>> 8) application/xml MIME type charset declaration (optional)
>>
>> ...and since all of them could be set to different encodings,
>> getting
>> it all straight is a dizzying adventure. Add to that the complexity
>> of handling things like users copying text from a Word document
>> created in Windows-1252 and pasting into a form on a UTF-8 page,
>> and...ugh! Sometimes I just want to kill myself.
>>
>> While I suppose that attempting to convert all data into us-ascii
>> through entity encoding gives us the "least common donominator"
>> solution -- make everything 7-bit! -- it obviously isn't working
>> perfectly. So perhaps any solution that simply makes it work,
>> regardless of whether or not it changes the use of
>> $xmlrpc_internalencoding, would be good. I did wonder about the
>> utf8_encode() function, and why you didn't simply use that
>> instead of
>> $character = ("&#".strval($code).";"); Won't that do all the right
>> work for you?
>>
>> In any case, I think you should try to make the XMLRPC
>> library follow
>> as closely as possible the relevant spec/RFC "recommended" behavior,
>> and let that be your guide.
>>
>>> Adding some extra settings to client/server objects is fine, but
>>> the causal user might not be used to using those, and backward
>>> compatability is a primary concern to me.
>>> Traduced in code that would probably mean adding some hacky stuff
>>> of the sort "object default charset preference is undefined, and
>>> while still undefined use global variable, otherwise use object
>>> preference" (doable but ugly).
>>> The though part is letting the client object communicate the
>>> desired charset encoding to the xmlrpcval object, since the
>>> responsibility of creating serialized content is left to the
>>> xmlrpcval object itself (and I'm surely not changing that
>>> fundamental assumption).
>>
>> If you converted $xmlrpc_internalencoding to a property of xmlrpcmsg
>> instead of a global variable, then you could simply set it to
>> default
>> to "iso-8859-1" in the constructor method for the class object. So
>> you maintain your default, but allow users to reset it through
>> scripting.
>>
>>> ps: the real (only ?) advantage of using variables instead of
>>> constnts for things such as internal_encoding is that you can
>>> redefine them not inside the xmlrpc lib but just after its
>>> inclusion, eg.
>>> <?php include('xmlrpc.inc'); $xmlrpc_internal_encoding = 'UTF-16';
>>> echo 'etc...'; ?>
>>> this way you do not have to change anything when updating...
>>
>> Ah, yes, this is true, and I hadn't really thought of such a simple
>> thing (but the same method holds true for using an object property).
>>
>> How the PEAR people are handling this:
>> http://pear.php.net/bugs/bug.php?id=52
>> ["According to RFC 3023 section 3.1, the encoding specified in the <?
>> xml encoding=... ?> tag should be ignored for XML received over HTTP
>> in favor of the encoding specified in the Content-Type header (e.g.
>> "Content-Type: text/xml; charset=iso-8859-1)."]
>>
>> I found another developer reflecting on these same questions, for a
>> blogging app that uses XML-RPC:
>> http://ecto.kung-foo.tv/archives/000975.php
>>
>> Other messages about the default encoding of unspecified xml
>> documents:
>> http://groups.yahoo.com/group/xml-rpc/message/45
>> http://mail.zope.org/pipermail/zope-collector-monitor/2004-October/
>> 004361.html (in reverse chronological order)
>>
>> Cheers,
>> spud.
>>
>> -------------------------------------------------------------------
>> a.h.s. boy
>> spud(at)nothingness.org            "as yes is to if,love is to yes"
>> http://www.nothingness.org/
>> -------------------------------------------------------------------
>>

-------------------------------------------------------------------
a.h.s. boy
spud(at)nothingness.org            "as yes is to if,love is to yes"
http://www.nothingness.org/
-------------------------------------------------------------------



More information about the phpxmlrpc mailing list