Encoding seems to be lost after export to PNG and extract back

0 votes
asked Oct 12 in Bug by Alexander
It seems encoding information is lost during export to PNG.

Steps:

1. Create sequence diagram with some text in Russian

2. Export to PNG

3. Now extract it back: java -Dfile.encoding=UTF-8 -jar plantuml.jar -metadata -charset UTF-8 diagram.png

Expected result: all text is readable

Actual result: text in Russian is corrupted, displayed as ?????? ?? ????
commented Oct 12 by albert (2,690 points)
Can you give an example input file to reproduce the problem.
Which version of the plantuml;jar are you using?
commented Oct 13 by Alexander
Example: https://yadi.sk/i/yvF7aWVKilyP0g
It was created using this command: java -jar plantuml.jar -charset UTF-8 -tpng charset.plantuml
Version:
λ java -jar plantuml.jar -version
PlantUML version 1.2018.11 (Sat Sep 22 19:43:53 MSK 2018)
(GPL source distribution)
Java Runtime: Java(TM) SE Runtime Environment
JVM: Java HotSpot(TM) 64-Bit Server VM
Java Version: 1.8.0_181-b13
Operating System: Windows 7
OS Version: 6.1
Default Encoding: Cp1252
Language: en
Country: US
Machine: 700634-PC
PLANTUML_LIMIT_SIZE: 4096
Processors: 4
Max Memory: 1,873,805,312
Total Memory: 126,877,696
Free Memory: 122,169,096
Used Memory: 4,708,600
Thread Active Count: 1
commented Oct 13 by albert (2,690 points)
Please supply the source code or cut and past the source code into http://www.plantuml.com/plantuml/uml and post the resulting url.
commented Oct 13 by albert (2,690 points)
Looks indeed like that there is some skew between the original and the information in the png file / transferred back file as the output is with question marks where the Russian text should be.

2 Answers

0 votes
answered Oct 15 by plantuml (189,260 points)
 
Best answer
Finally it was easy to use iTXt chunk.

So this should be solved in last beta http://beta.plantuml.net/plantuml.jar

Tell us if it's not working for you!
commented Oct 15 by Alexander
Actually result still the same.
Maybe I need to use some specific command line options?
commented Oct 15 by plantuml (189,260 points)
Maybe I should be more specific.
You have to re-encode (that is, to re-create a new PNG file) with the last beta version.
And then extract metadata back from this new PNG files.
PNG files that have been generated with older versions of PlantUML cannot be retrieve (sorry about that)
commented Oct 15 by Alexander
Yes, that's clear and actually that is exactly what I tried to do(so I actually repeated steps from my initial description)
But as a result I still see corrupted characters.
Could you please provide recommended settings which work for your environment!?
commented Oct 15 by plantuml (189,260 points)
Could you send us by email your PNG file ?

BTW, I see that you have:
Default Encoding: Cp1252
This means that your default console cannot display russian characters (well, I think :-)

You have to use the following command line :

java -Dfile.encoding=UTF-8 -jar plantuml.jar -metadata -charset UTF-8 diagram.png > back_to_text.txt

Then edit "back_to_text.txt" file with some UTF-8 editor.
commented Oct 16 by Alexander
Now it's ok, probably I made a mistake during initial test.
Thanks a lot, this software absolutely briliant!
0 votes
answered Oct 14 by plantuml (189,260 points)
Thanks for the report.

We are using standard zTXt chunk to store PlantUML source (see http://dev.exiv2.org/projects/exiv2/wiki/The_Metadata_in_PNG_files )

Sadly, zTXt must be encoded using ISO/IEC 8859-1 which means that Russian cannot be used there :-(

We could use iTXt chunk that could be compressed, but the use of those chunks is not very documented (at least in Java), so we did not succeed (yet) in compressing them.

Another option would be to encode PlantUML source using UTF-7 ( https://en.wikipedia.org/wiki/UTF-7) when we detect that some non ISO-8859-1 are used. Then we could store the UTF-7 encoded String in zTXt chunk.

So stay tuned, we'll post some message here when we will be ready to test.

Regards,
...