Decode html encoded strings in python

Question

Decode html encoded strings in python

I have the following line ...

"Scam, hoax, or the real deal, he&#8217;s gonna work his way to the bottom of the sordid tale, and hopefully end up with an arcade game in the process."

I need to turn it into this string ...

Whether it's a scam, a hoax or a real deal, he's about to pave his way to the bottom of the dirty tale and hopefully end up with an arcade game in the process.

This is pretty standard HTML encoding, and I can't figure out for a lifetime how to convert it to python.

I found this: GitHub

And it is very close to work, however it does not output an apostrophe but some unicode character instead.

Here's an example output from a GitHub script ...

Whether it's a scam, a cheat or a real deal, heâs going to pave the way to the foundations of a dirty tale, and hopefully end up with an arcade game in the process.

+1

python html xml

Lounges May 27 '09 at 4:32

a source to share

1 answer

las3rjock · Accepted Answer · 2009-05-27T04:49:52+0000

What you are trying to do is called "HTML entity decoding" and it covers a number of past stack overflow questions, for example:

Here's a piece of code using Beautiful Soup HTML parsing to decode your example:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from BeautifulSoup import BeautifulSoup

string = "Scam, hoax, or the real deal, he&#8217;s gonna work his way to the bottom of the sordid tale, and hopefully end up with an arcade game in the process."
s = BeautifulSoup(string,convertEntities=BeautifulSoup.HTML_ENTITIES).contents[0]
print s

Here's the output:

Whether it's a scam, a hoax or a real deal, he's about to pave the way to the bottom of a dirty tale, and hopefully end up with an arcade game in the process.

Decode html encoded strings in python

More articles: