[phpxmlrpc] xmlrpc_encode_entitites causing parse error

Gaetano Giunta giunta.gaetano at sea-aeroportimilano.it
Tue Nov 15 16:31:06 GMT 2005


Very toughtful response.

UTF-8 everywhere is fine and dandy but for 2 aspects:

- in fact XML-over-http without a charset declaration SHOULD be assumed to be ISO-8859-1 (there is a RFC somewhere about that, which I cannot recall now).
The xmlrpc lib got it wrong the first time around, but I never dared to cahnge the global var to a more 'correct' default, as the only benefit I imagine would have been breaking a lot of people's scripts.
This basically contradicts the argument 'UTF-8' is universal: xmlrpc clients written in other languages might (correctly) make the  assumption that the received xml charset is iso-8859-1 when unspecified, and dutifully choke on utf-8 characters.

- unless mbstring is enabled, all PHP processing is carried out in ISO-8859-1 (of course, this does not apply to data gotten of your DB directly in UTF-8 encoding)

Having said that, there is no guarantee that strings that the user gets out of his db are in fact utf-8, and sending some weird japanese charset using an utf-8 declaration is most likely wrong.

Adding some extra settings to client/server objects is fine, but the causal user might not be used to using those, and backward compatability is a primary concern to me.
Traduced in code that would probably mean adding some hacky stuff of the sort "object default charset preference is undefined, and while still undefined use global variable, otherwise use object preference" (doable but ugly).
The though part is letting the client object communicate the desired charset encoding to the xmlrpcval object, since the responsibility of creating serialized content is left to the xmlrpcval object itself (and I'm surely not changing that fundamental assumption).

I think I need a copule of days to sort out a good solution...

Bye
Gaetano

ps: the real (only ?) advantage of using variables instead of constnts for things such as internal_encoding is that you can redefine them not inside the xmlrpc lib but just after its inclusion, eg.
<?php include('xmlrpc.inc'); $xmlrpc_internal_encoding = 'UTF-16'; echo 'etc...'; ?>
this way you do not have to change anything when updating...

> -----Original Message-----
> From: a.h.s. boy (lists) [mailto:spudlists at nothingness.org]
> Sent: Tuesday, November 15, 2005 4:34 PM
> To: Gaetano Giunta
> Cc: phpxmlrpc at lists.usefulinc.com
> Subject: Re: [phpxmlrpc] xmlrpc_encode_entitites causing parse error
> 
> 
> On Nov 15, 2005, at 4:11 AM, Gaetano Giunta wrote:
> 
> > Brief analysis:
> >
> > - the lib tries to encode all chars outside of the ASCII range as  
> > 'XML character entity' when serializing
> 
> I understand the theory, but one of the benefits to using UTF-8 in  
> the first place is its ability to properly render all sorts of  
> languages and character sets. Debugging becomes brutal when you're  
> staring at a huge string of HTML entities.
> 
> > - this has the main benefit that such an xml is valid 
> regardless of  
> > the charset assumed by the parser, i.e. we do not need to add a  
> > 'charset' parameter to either the HTTP Content-type header or the  
> > XML prologue
> 
> Well...apparently it isn't valid XML despite the lack of 
> charset...or  
> we wouldn't be having this discussion! ;-)
> 
> > - it is also the best solution I could come up with to solve the  
> > long-standing problems with cahrset encodings (I also tried the  
> > other way round, e.g. explicitly stating the charset used for xml,  
> > in a private fork of the lib I use for personal projects, but I  
> > would rather stick with the current approach, as it solves the  
> > problem in a more elegant way)
> 
> Believe me, I totally understand the issue of long-standing charset  
> encoding problems! I've been developing a CMS that needs to handle  
> multiple languages, alphabets, directionality, and XML-RPC/RSS feeds  
> all on the same page! Not easy, especially if your own linguistic  
> range is limited to English and Romance languages!
> 
> But I'm also a fan of proper declarations...and I'd rather have an  
> XML feed explicitly declare its charset encoding (and work) than try  
> to be "universal" and fail. :-)
> 
> I'll admit to not being fully familiar with all the XMLRPC library  
> code -- only enough to debug a bit -- but it appears that  
> $xmlrpc_internalencoding is declared as a global variable, though it  
> is only used in object methods. Could it be changed to be a property  
> of the xmlrpcmsg and xmlrpc_server classes? That way it could be set  
> through scripting with
> 
> $xmlrpcmsg->set_internalencoding($foo);
> 
> or something similar? That would be more flexible, and since you  
> _always_ know what the encoding is, you can send it in the XML  
> prologue, which is what that parameter is designed for anyway.
> 
> > - basically, I see two options to extend the lib to make up for  
> > your problem:
> >   + extend the xmlrpc_encode_entitites function to take into  
> > account the xmlrpc_internalencoding global var, and use 2 
> different  
> > parsing alghoritms (better solution but slower)
> 
> Well...UTF-8 should only require converting "&", "<", and '"'  
> explicitly, and the rest is assumed to be valid. So the only fork  
> you'd need in the code is to convert additional entities for non- 
> UTF-8 encodings. Shouldn't slow anything down...in fact, it would  
> make UTF-8 faster, since it would skip additional processing.
> 
> In fact, I may be mistaken, but it seems like older versions of the  
> library didn't even do the entity translation...at least, in the  
> course of my own development, I know I included some entity  
> conversion routines to process the data _before_ I sent it to the  
> XMLRPC library (but it may have been redundant on my part). Though I  
> admit I do like the idea that I can pass _anything_ to the XMLRPC  
> library and have it properly encoded for me!
> 
> > Would you be willing to test the patches?
> 
> Absolutely...but I do think you should give some serious thought to  
> making the internal encoding variable more scriptable so no one ever  
> needs to hard-code changes in the script file. I hate having to  
> remember to change the variable value whenever I upgrade the 
> library...
> 
> Cheers,
> spud.
> 
> 
> -------------------------------------------------------------------
> a.h.s. boy
> spud(at)nothingness.org            "as yes is to if,love is to yes"
> http://www.nothingness.org/
> -------------------------------------------------------------------
> 
> 


More information about the phpxmlrpc mailing list