How does this site fix the encoding?
I am trying to include this text:
×וויר. ×"עתי×" של רשתות חברתיות ו×"תקשורת ×©×œ× ×•
Into this text:
אוויר. העתיד של רשתות חברתיות והתקשורת שלנו
Somehow, this site:
http://www.pixiesoft.com/flip/
You can do this and I would like to know how can I do this myself (with any programming language or software)
Just save the file as UTF8 won't do that.
My motivation for this question is that I have a file exported as a malformed text XML file that I want to turn into a corrected Hebrew text file.
The XML export was originally malformed by the MySQL import and export, but I don't have the information I need to fix it or track down the problem.
Thanks.
a source to share
Since the problem was a MySQL bug on double-encoded UTF8 strings, MySQL is the right way to solve it.
Executing the following commands will resolve this issue -
-
mysqldump $DB_NAME -u $DB_USER -p -h $DB_HOST.EXAMPLE.NET --add-drop-table --default-character-set=latin1 > export.sql
- latin1 is used here to force MySQL not to strip characters and should not be used otherwise. -
cp export{,.utf8}.sql
- creation of a backup copy. -
sed -i -e 's/latin1/utf8/g' export.utf8.sql
- Replacing latin1 with utf8 in the file to import it as UTF-8 instead of 8859-1. -
mysql $DB_NAME -u $DB_USER -p -h $DB_HOST.EXAMPLE.NET < export.utf8.sql
- import everything back to the database.
This will fix the problem in about ten minutes.
a source to share
If you look closely at the gibberish, you can tell that every character in Hebrew is encoded as 2 characters - it appears to be של
encoded as של
.
This suggests that you are looking at UTF8 or UTF16 as ASCII. Converting to UTF8 will not help as it is already ASCII and will contain that encoding.
You can read each byte pair and recover the original UTF8 from them.
This is what C # I came up with - it's very simplistic (doesn't completely work - too many assumptions), but I could see that some of the characters were converted correctly:
private string ToProperHebrew(string gibberish)
{
byte[] orig = Encoding.Unicode.GetBytes(gibberish);
byte[] heb = new byte[orig.Length / 2];
for (int i = 0; i < orig.Length / 2; i++)
{
heb[i] = orig[i * 2];
}
return Encoding.UTF8.GetString(heb);
}
If it appears that each byte is recoded as two bytes - not sure if the encoding is used for that, but discarding one byte seemed to be correct for most doubled characters.
a source to share
Based on the answers from Oded and Teddy I came up with this method that worked for me:
public String getProperHebrew(String gibberish){
byte[] orig = gibberish.getBytes(Charset.forName("windows-1252"));
try {
return new String(orig, "UTF-8");
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
return "";
}
}
a source to share