Aug
29

UTF-8 Encoding problems with file_get_contents() and DOMDocument

I recently bumped into an encoding issue on a project I was working on.
I was trying to scrape some content off a website that had ISO-8859-1 charset encoding, and I needed to capture some text and store it in a database as UTF-8.

After some trial and error I discovered a way to properly change the encoding before saving it in the DB.

A simplified version of what I did:

 $url = 'http://www.smooka.com/blog/';
 $html = file_get_contents($url);
 
 //Change encoding to UTF-8 from ISO-8859-1
 $html = iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $html);


This partially solved my problem. But for some reason single quotes where displayed with a question mark symbol. After reading documentation, forums, etc.. I discovered that I needed to translate ASCII characters.

 $dom = new Zend_Dom_Query($html);
 $results = $dom->query('div p');

 foreach ($results as $result)
 {
    $line = $result->nodeValue;
    $line = iconv('UTF-8', 'ASCII//TRANSLIT', $line);
 }

Hope this saves you some time, as I was not able to find a complete solution on the web so I had to piece everything together.

2 Comments to “UTF-8 Encoding problems with file_get_contents() and DOMDocument”

  • Sr. software engineer only. Contract to code an incredible project in Fort Lauderdale. If you're interested please contact me: gary@indelibleimprints.com

  • Muito Obrigado cara! Hay man, thank you very much. You save my time.

Leave a comment