Decode html encoded strings in python
I have the following line ...
"Scam, hoax, or the real deal, he’s gonna work his way to the bottom of the sordid tale, and hopefully end up with an arcade game in the process."
I need to turn it into this string ...
Whether it's a scam, a hoax or a real deal, he's about to pave his way to the bottom of the dirty tale and hopefully end up with an arcade game in the process.
This is pretty standard HTML encoding, and I can't figure out for a lifetime how to convert it to python.
I found this: GitHub
And it is very close to work, however it does not output an apostrophe but some unicode character instead.
Here's an example output from a GitHub script ...
Whether it's a scam, a cheat or a real deal, heâs going to pave the way to the foundations of a dirty tale, and hopefully end up with an arcade game in the process.
a source to share
What you are trying to do is called "HTML entity decoding" and it covers a number of past stack overflow questions, for example:
Here's a piece of code using Beautiful Soup HTML parsing to decode your example:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from BeautifulSoup import BeautifulSoup
string = "Scam, hoax, or the real deal, he’s gonna work his way to the bottom of the sordid tale, and hopefully end up with an arcade game in the process."
s = BeautifulSoup(string,convertEntities=BeautifulSoup.HTML_ENTITIES).contents[0]
print s
Here's the output:
Whether it's a scam, a hoax or a real deal, he's about to pave the way to the bottom of a dirty tale, and hopefully end up with an arcade game in the process.
a source to share