How do I find the byte encoding of a TIBCO Rendezvous message?
In my Java application, I archive TIBCO RV messages to a file as bytes.
I am writing a small utility application that will play messages. This way I can just create a TibrvMsg object from bytes without having to parse the file and construct the object manually.
The problem is that I am reading a file that was created in a Linux box and I am trying to run my application on a Windows machine. I am getting an error because of the different encoding the file was written in.
So now I want to record each message in a specific encoding (UTF-8), so I don't care which platform I run my application to play on. The application should simply read into the file, knowing before passing the encoding in which the file is written. I am planning to use java.nio packages for this to convert bytes from one encoding to another.
Do I need to know what encoding the bytes of the TIBRV message are encoded in in order to make the conversion? If so, how can I find out?
a source to share
You are taking opaque data and appears to be trying to write it to a file as textual data without escaping its non-textual parts (alternatively, you write it as raw bytes and then try to read it as if it were character based. which is the same problem). This is wrong from the start.
Opaque data should be treated as meaningless and simply persisted unchanged in order to return an API that knows how to deal with it. If the data is to be stored in text form, you must convert the bytes to text losslessly. Relevant encodings are things like base64. An encoding in the sense of a character set encoding is NOT lossless if you apply it to raw binary data.
Simply storing the bytes in a file as bytes (not characters), along with a fixed-length prefix indicating the length of the message and the object to which it was sent, is sufficient to play RV messages through the system.
Regarding any text fields inside the message, if there is an encoding problem (I highly recommend avoiding this altogether when developing your application), you have the same problem on playback as you did at the original receive time, is to convert from original encoding to desired encoding (hopefully using exactly the same code), so this should be a non-reproduction issue.
a source to share
As this (admittedly quite old) mailing list post points out, little is known about the internal structure of this networking protocol. This can make what you are looking for quite a challenge.
However, if the messages are just binary blocks of data (received from the network), they don't even have to have an encoding. Charsets for text data where this is important as a single character can be encoded in many different ways. Binary data does not consist of characters, so encoding is not possible in this sense.
a source to share
This is probably due to the Java string encoding, not TIBRV. Although there is this in the documentation:
Strings and Character Encodings -------------------------------------------------- ------------------------------ Rendezvous software uses strings in several roles: * String data inside message fields * Field names * Subject names (and other associated strings that are not strictly inside the message) * Certified delivery correspondent names * Group names (fault tolerance) All these strings (both in C and in wire format) use the character encoding appropriate to the ISO locale of the sender. For example, the United States is locale en_US, and uses the Latin-1 character encoding (also called ISO 8859-1); Japan is locale ja_JP, and uses the Shift-JIS character encoding. When two programs exchange messages within the same locale, strings are always correct. However, when a message sender and receiver use different character encodings, the receiving program must convert between encodings as needed. Rendezvous software does not convert automatically. EBCDIC For information about string encoding in EBCDIC environments, see tibrv_SetCodePages ().
So you can look at the locale of the machines.
a source to share
Do I need to know what encoding the TIBRV message bytes are encoded in to make the conversion?
Yes. Encoding is a method of converting text to a stream of bytes and vice versa. Your network data is a stream of bytes, so when you interpret parts of it as text, you ARE (implicitly or explicitly) using the encoding - the question is whether it is correct.
Converting bytes from one encoding to another basically means converting them to text using one encoding and then back to bytes using another. Note that this can change the length of the data, since many character sets use more than 1 byte for many characters. In the context of network messages, this can be problematic when it overrides length fields or causes text fields to overflow. It's probably best not to do any conversions and instead teach the reader application to learn how to deal with the various encodings.
If so, how can I find out?
Take a look at the protocol specification.
a source to share