Aug
29
29
UTF-8 Encoding problems with file_get_contents() and DOMDocument
I recently bumped into an encoding issue on a project I was working on.
I was trying to scrape some content off a website that had ISO-8859-1 charset encoding, and I needed to capture some text and store it in a database as UTF-8.
After some trial and error I discovered a way to properly change the encoding before saving it in the DB.
A simplified version of what I did:
$url = 'http://www.smooka.com/blog/';
$html = file_get_contents($url);
//Change encoding to UTF-8 from ISO-8859-1
$html = iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $html);
This partially solved my problem. But for some reason single quotes where displayed with a question mark symbol. After reading documentation, forums, etc.. I discovered that I needed to translate ASCII characters.
$dom = new Zend_Dom_Query($html);
$results = $dom->query('div p');
foreach ($results as $result)
{
$line = $result->nodeValue;
$line = iconv('UTF-8', 'ASCII//TRANSLIT', $line);
}
Hope this saves you some time, as I was not able to find a complete solution on the web so I had to piece everything together.
1 Comment to “UTF-8 Encoding problems with file_get_contents() and DOMDocument”
Leave a comment
Recent Posts
Tags
add-on
Ajax
alerts
api
application
Apps
border-radius
charts
CMS
CSS
CSS3
CURL
debug
dojo
Facebook
firebug
firefox
getsimple
google
iPhone
jAlert
JavaScript
jQuery
Linux
log
messages
optimization
PHP
plug-in
programming
prototype
Resources
session
squareit
tips
tweetmeme
Ubuntu
wildfire
XML
yahoo
Zend
Zend Framework
Zend Studio
Zend_Cache
Zend_Feed
Sr. software engineer only. Contract to code an incredible project in Fort Lauderdale. If you're interested please contact me: gary@indelibleimprints.com