[phpxmlrpc] xmlrpc_encode_entitites causing parse error

Gaetano Giunta giunta.gaetano at sea-aeroportimilano.it
Wed Nov 16 09:01:11 GMT 2005


Darn, just when I thought I had reached charset-encoding guru state, I discover I was mostly wrong.
I really love to be a coder...

> ...
> On Nov 15, 2005, at 11:31 AM, Gaetano Giunta wrote:
> 
> > Very toughtful response.
> 
> Man, I love cross-linguistic typos...makes great new English words:  
> "toughtful" = "tough thoughtfulness". Brilliant.

I can do a lot better if you wish, mixing up italian, french, english and php typos all in the same sentence ;)

> > UTF-8 everywhere is fine and dandy but for 2 aspects:
> >
> > - in fact XML-over-http without a charset declaration SHOULD be  
> > assumed to be ISO-8859-1 (there is a RFC somewhere about that,  
> > which I cannot recall now).
> 
> Hmmm. The XML 1.0 spec (http://www.w3.org/TR/2000/REC-xml-20001006)  
> reads:
> 
> ...
> 
> RFC 2376, however, offers suggestions for XML MIME-types sent over  
> HTTP, but it reads (pardon the length):
> 
> ...

OK, I'll admit I blew this one.
I cannot figure outh which RFC I (mis)read that convinced me that latin-1 was the way to go for text/xml over http, but RFC 3023 is definitely THE reference on this subject. And it states that
- a charset-encoding SHOULD be put in the http headers for interop's sake
- when that is unavailabe, xml MUST be treated as US-ASCII (regardless of the xml prologue...)

> ...
> But I know that my RDFParser class, for example, defaults to "utf-8"  
> and overrides that only if the encoding is specified as something  
> else in the xml delaration. I assume I made that decision for good  
> reasons, though I don't remember them now!

Most likely having bad sources of xml that send utf-8 stuff without declaring it explicitly. Very annoying, but quite common, at least a little while ago.

> 
> Still, the number of factors affecting encoding and transmission are  
> unbelievably complex.
> ...
> and...ugh! Sometimes I just want to kill myself.

Yup, I only had the chance to prove myself with an arabic website once. It was great fun, and source of a lot of learning, but it never went online (and the translator refused to translate single phrases as I had specced, to be put in the translation engine db, but insisted on giving me bak the 5 page translation document without hinting at any separation of paragraphs...)

> 
> While I suppose that attempting to convert all data into us-ascii  
> through entity encoding gives us the "least common donominator"  
> solution -- make everything 7-bit! -- it obviously isn't working  
> perfectly.

This is btw a 'road accident' not a by-design feature, and the previous situation was wrong anyway.
The general solution (i.e. let the lib encode any internal charset to ascii) is a bit daunting to be coded in php, but to add the 80% case (ie utf8 to ascii) I think is quite easy. AND we are following the spec.

> So perhaps any solution that simply makes it work,  
> regardless of whether or not it changes the use of  
> $xmlrpc_internalencoding, would be good. I did wonder about the  
> utf8_encode() function, and why you didn't simply use that 
> instead of  
> $character = ("&#".strval($code).";"); Won't that do all the right  
> work for you?

Yes, provided that we added UTF-8 in the http headers.
No, in the current situation.

> 
> In any case, I think you should try to make the XMLRPC 
> library follow  
> as closely as possible the relevant spec/RFC "recommended" behavior,  
> and let that be your guide.
> ...

What I am currently thinking about is something along the lines:

1 - add support for xmlrpc_internalencoding in xmlrpc_encode_entities(), ONLY for utf-8 to ascii, ascii-to-ascii and iso-8859-1 to ascii

2 - add support for specific charset encodings into xmlrpcmsg. If left unspecified, defaults to us-ascii, as per the current behaviour. When specified, it will modify the http content-type header, and potentially save a lot of time while NOT encoding special chars into xml entities

3 - figure out wheter the response charset encoding should be left to decide to the response object or to the server. Hint: the server can make intelligent decisions based on the client's http headers (accepted-charset).


Bye
Gaetano


More information about the phpxmlrpc mailing list