PHP - DOM class and encoding issues

I'm having a hard time with the PHP DOM class.

I am creating a sitemap script and I need the output of $ doc-> saveXML () to be like

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <url>
        <loc>http://www.somesite.com/servi&#xE7;os/redesign</loc>
    </url>
</root>

      

or

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <url>
        <loc>http://www.somesite.com/servi&#231;os/redesign</loc>
    </url>
</root>

      

but i get:

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <url>
        <loc>http://www.somesite.com/servi&amp;#xE7;os/redesign</loc>
    </url>
</root>

      

This is the closet I could get using the replace named to numbered entities function.

I was also able to reproduce

<?xml version="1.0" ?>
<root>
    <url>
        <loc>http://www.somesite.com/servi&amp;#xE7;os/redesign</loc>
    </url>
</root>

      

But without the specified encoding.

Best solution (as it seems to me the code should be written):

<?php
$myArray = array();
// do some stuff to populate the with URL strings

$doc = new DOMDocument('1.0', 'UTF-8');

// here we modify some property. Maybe is the answer I am looking for...

$urlset = doc->createElement("urlset");
$urlset = $doc->appendChild($urlset);

foreach($myArray as $address) {
    $url = $doc->createElement("url");
    $url = $urlset->appendChild($url);

    $loc = $doc->createElement("loc");
    $loc = $url->appendChild($loc);

    $valueContent = $doc->createTextNode($value);
    $valueContent = $loc->appendChild($address);
}

echo $doc->saveXML();
?>

      

Notes:

  • The server response header contains charset as UTF-8;
  • PHP script is saved in UTF-8;
  • Reading URLs are UTF-8 strings;
  • The above script contains the encoding declaration in the DOMDocument constructor and does not use any conversion functions like htmlentities, urlencode, utf8_encode ...

I tried changing the DOMDocument properties of DOMDocument :: $ resolveExternals and DOMDocument :: $ substituteEntities. Neither combination worked.

And yes, I know I can do the whole process without specifying the character set in the DOMDocument constructor, the content of the dump string in a variable, and do very simple string replacement using string replacement functions. It works. But I would like to know where I am slipping, how it can be done using native API and settings, or even if it is possible.

Thanks in advance.

+2


a source to share


2 answers


resolveExternals

and substituteEntities

- parser functions. They do not affect serialization.

The XML information material makes no distinction between:

<loc>http://www.somesite.com/serviços/redesign</loc>
<loc>http://www.somesite.com/servi&#xE7;os/redesign</loc>
<loc>http://www.somesite.com/servi&#231;os/redesign</loc>

      

they all present exactly the same information, any XML parser must treat them as identical, and XML serializers usually don't let you choose what to output. Usually you just need to specify the node value for the text ç

and let the serializer write it ç

to as the original UTF-8 byte string in the output file.

If you really have to generate an XML file that contains only ASCII so you cannot directly use type characters ç

, then tell PHP to use ASCII as the encoding of the document:

$s= "serviços"; // or "\xC3\xA7" if you can't input UTF-8 strings directly

$doc = new DOMDocument('1.0', 'US-ASCII');
$doc->appendChild($loc= $doc->createElement('loc'));
$loc->appendChild($doc->createTextNode($s));
echo $doc->saveXML();

      



result:

<?xml version="1.0" encoding="US-ASCII"?>
<loc>servi&#231;os</loc>

      

However, having said all this, I still don't think this is correct. Your value appears to be a URL, and non-ASCII characters are invalid in URLs, no matter how encoded in the containing XML. It should be:

http://www.somesite.com/servi%C3%A7os/redesign

      

t rawurlencode('serviços')

.

0


a source


Decode your objects before passing it to createTextNode

$valueContent = $doc->createTextNode(html_entity_decode($value, ENT_QUOTES, 'UTF-8'));

      



This is because & # 231; is not a valid object in a UTF-8 document. So DomDocument sees and and encodes it as &

0


a source







All Articles