[GTALUG] a solved problem unsolved itself: WordPress, MySQL, UTF-8

Jamon Camisso jamonation at gmail.com
Wed Dec 1 21:53:33 EST 2021


On 01/12/2021 08:05, Stewart C. Russell via talk wrote:
> On 2021-11-29 16:25, Jamon Camisso via talk wrote:
>>
>> Another thing to try is using mysqli_set_charset("UTF8"); somewhere in 
>> your site's code. Substitute in different character sets until you 
>> find the correct one ...
> 
> Thanks, Jamon, but there isn't a valid encoding for what my database 
> seems to be holding. It was UTF-8, and now it's seemingly UTF-8 decoded 
> to CP1252 bytes re-encoded to UTF-8 characters again.
> 
> If WordPress were using Python (it's not), if my db held the 4 
> character, 6 byte UTF-8 string, the equivalent Python code to end up in 
> the mess I'm in is:
> 
>      >>> bytes(bytes("côté",encoding='utf-8').decode(encoding='cp1252'), 
> encoding='utf-8')
>      b'c\xc3\x83\xc2\xb4t\xc3\x83\xc2\xa9'
> 
> or 6 characters / 10 bytes of gibberish ('côté').
Since that encoding is reversible, can you attempt it on some of the 
corrupted posts/pages? e.g.

 >>> bytes(bytes('côté', encoding='utf-8').decode(), 
encoding='cp1252').decode()
'côté'

> Since this happened in the last month or so, it's not really a legacy 
> encoding issue. Perfectly good UTF-8 got destroyed with no input/changes 
> from me.
> 
> I'd been fairly careful with backups for the first decade of running 
> this blog, but the process got wearing after a while, especially since 
> every update went flawlessly so the manual backup process was a waste of 
> time. Wordpress offers automatic updates without forcing a backup 
> checkpoint, which I think is wrong.

Is it a managed Wordpress? That's terribly bad sounding if it is. Worse 
I suppose if Wordpress itself just did it.

Do any of the casting suggestions on that link that I sent fix it? Or 
are you going to have to dump each row and run them through that 
double-decoding process?

Jamon


More information about the talk mailing list