How does this site fix the encoding?

Question

How does this site fix the encoding?

I am trying to include this text:

××•×•×™×¨. ×"×¢×ª×™×" ×©×œ ×¨×©×ª×•×ª ×—×‘×¨×ª×™×•×ª ×•×"×ª×§×©×•×¨×ª ×©×œ× ×•

Into this text:

אוויר. העתיד של רשתות חברתיות והתקשורת שלנו

Somehow, this site:

http://www.pixiesoft.com/flip/

You can do this and I would like to know how can I do this myself (with any programming language or software)

Just save the file as UTF8 won't do that.

My motivation for this question is that I have a file exported as a malformed text XML file that I want to turn into a corrected Hebrew text file.

The XML export was originally malformed by the MySQL import and export, but I don't have the information I need to fix it or track down the problem.

Thanks.

+1

mysql encoding utf-8 character-encoding hebrew

Tal galili May 15 '10 at 12:03

a source to share

6 answers

If you look closely at the gibberish, you can tell that every character in Hebrew is encoded as 2 characters - it appears to be של

encoded as ×©×œ

.

This suggests that you are looking at UTF8 or UTF16 as ASCII. Converting to UTF8 will not help as it is already ASCII and will contain that encoding.

You can read each byte pair and recover the original UTF8 from them.

This is what C # I came up with - it's very simplistic (doesn't completely work - too many assumptions), but I could see that some of the characters were converted correctly:

private string ToProperHebrew(string gibberish)
{
   byte[] orig = Encoding.Unicode.GetBytes(gibberish);
   byte[] heb = new byte[orig.Length / 2];

   for (int i = 0; i < orig.Length / 2; i++)
   {
     heb[i] = orig[i * 2];
   }

   return Encoding.UTF8.GetString(heb);
}

If it appears that each byte is recoded as two bytes - not sure if the encoding is used for that, but discarding one byte seemed to be correct for most doubled characters.

+2

Oded May 15 '10 at 12:14

a source to share

You might want to look here - the accepted answer to this question shows a way how to guess the encoding byte[]

. All you need is getting the correct bytes out of the gibberish. Of course, guessing can always fail ...

+2

the.duckman May 15 '10 at 12:24

a source to share

You can use a meta tag to set the correct encoding for your page. Here's an example of how you can do it:

I assume this encoding will do the job.

+1

Thea May 15 '10 at 12:08

a source to share

Based on the answers from Oded and Teddy I came up with this method that worked for me:

public String getProperHebrew(String gibberish){
    byte[] orig = gibberish.getBytes(Charset.forName("windows-1252"));

    try {
        return new String(orig, "UTF-8");
    } catch (UnsupportedEncodingException e) {
        e.printStackTrace();
        return "";
    }
}

+1

Adrian Dec 16 11 at 15:49

a source to share

gibberish.encode('windows-1252').decode('utf-8', 'replace')

0

dan04 May 26 '10 at 13:41

a source to share

Tomer cohen · Accepted Answer · 2010-05-16T09:13:52+0000

Since the problem was a MySQL bug on double-encoded UTF8 strings, MySQL is the right way to solve it.

Executing the following commands will resolve this issue -

mysqldump $DB_NAME -u $DB_USER -p -h $DB_HOST.EXAMPLE.NET --add-drop-table --default-character-set=latin1 > export.sql

- latin1 is used here to force MySQL not to strip characters and should not be used otherwise.
cp export{,.utf8}.sql

- creation of a backup copy.
sed -i -e 's/latin1/utf8/g' export.utf8.sql

- Replacing latin1 with utf8 in the file to import it as UTF-8 instead of 8859-1.
mysql $DB_NAME -u $DB_USER -p -h $DB_HOST.EXAMPLE.NET < export.utf8.sql

- import everything back to the database.

This will fix the problem in about ten minutes.

How does this site fix the encoding?

More articles: