Endianness and the C API: OpenSSL in particular

I have an algorithm that uses the following OpenSSL calls:

HMAC_update() / HMAC_final() // ripe160
EVP_CipherUpdate() / EVP_CipherFinal() // cbc_blowfish

      

This algorithm takes unsigned char *

in "plain text". My input comes from C ++ std::string::c_str()

which comes from a protocol buffer object as a UTF-8 encoded string. UTF-8 strings must be neutral. However, I'm a little paranoid about how OpenSSL can perform operations on data.

My understanding is that encryption algorithms work with 8-bit blocks of data, and if a is unsigned char *

used for pointer arithmetic when performing operations, the algorithms should be limb-neutral and I don't need to worry about anything. My uncertainty is compounded by the fact that I am working on a low-rise machine and have never run any actual cross-architecture software.

My beliefs / reasoning is based on the following two properties

  • std :: string (not wstring) internally uses an 8-bit ptr, and the resulting c_str()

    ptr will follow the same path regardless of the CPU architecture.
  • Encryption algorithms are either design or implementation neutral neutral.

I know the best way to get a definitive answer is to use QEMU and run some cross platform unit tests (which I plan on doing). My question is a request for comments on my reasoning and may help other programmers facing similar problems.

+2


a source to share


4 answers


Some cryptographic algorithms, in particular hash functions (which are used in HMAC), are specified to operate on an arbitrary sequence of bits. However, on real physical computers and most protocols, data is a sequence of octets: the number of bits is a multiple of eight, and the bits can be processed in groups of eight. A group of eight bits is nominally an "octet", but the term "byte" is more common. The octet has a numeric value between 0 and 255, inclusive. In some programming languages ​​(like Java) the numeric value is signed (between -128 and +127), but this is the same concept.

It should be noted that in the context of a programming language (as defined in ISO 9899: 1999, the so-called "C standard"), a byte is defined as an atomic addressable block of memory embodied in a unsigned char

type. sizeof

returns the size in bytes (thus sizeof(unsigned char)

necessarily equal to 1). malloc()

takes the size in bytes. In C, the number of bits in a byte is defined by a macro CHAR_BIT

(defined in <limits.h>

) and is greater than or equal to eight. On most computers, byte C has exactly eight bits (ie, byte C is an octet, and everyone calls it a "byte"). There are several systems with large bytes (often onboard DSP), but if you had such a system you would know that.

Thus, every cryptographic algorithm that operates on arbitrary bit sequences actually determines how the bits are internally interpreted in octets (bytes). AES and SHA specifications go to great lengths do it right, even in the eyes of picky mathematicians. For every practical situation, your data is already a sequence of bytes, and it is assumed that the grouping of bits into bytes has already taken place; so you just load the bytes into the algorithm implementation and you're fine.

Hence, from a practical point of view, cryptographic algorithm implementations expect a sequence of bytes as input and produce sequences of bytes as output.

Endianness (implicitly at the byte level) is a convention for how multibyte values ​​(values ​​that need to encode multiple bytes) are laid out in a sequence of bytes (that is, each byte comes first). UTF-8 is end-neutral because it already defines this layout: when a character has to be encoded into multiple bytes, UTF-8 determines which of those bytes comes first and which continues last. This is why UTF-8 is "end-neutral": converting characters to bytes is a fixed convention that does not depend on how the local hardware best reads or writes bytes. Specificity is most often associated with how integer values ​​are written to memory.

About cross-platform programming: Cannot be replaced. So trying on multiple platforms is a good way to go. You will already learn a lot by making your code 64-bit clean, that is, having the same code that works correctly on both 32-bit and 64-bit platforms. Any new Linux PC will comply with the bill. Big end systems are quite rare nowadays; you will need an older Mac (one with a PowerPC processor) or one of several kinds of Unix workstations (Sparc systems or Itanium systems for HP / UX). Newer designs tend to take the convention with little ends.



About content in C: If your program has to worry about content, then chances are you are doing it wrong. Endianness is the conversion of integers (16 bits, 32 bits or more) to bytes and vice versa. If your code is worried about endianness, it means that your code is writing data as integers and reading it as bytes, or vice versa. In any case, you are doing "type smoothing": some parts of the memory are accessible through several pointers of different types. This is bad. This not only makes your code less portable, but it also tends to break when the compiler is asked to optimize your code.

In a correct C program, endianness is handled only for I / O, when values ​​are to be written to or read from a file or network socket. This I / O follows a protocol that defines the end use (for example, in TCP / IP, the term "big end" is widely used). The "correct" way is to write several wrapper functions:

uint32_t decode32le(const void *src)
{
    const unsigned char *buf = src;
    return (uint32_t)buf[0] | ((uint32_t)buf[1] << 8)
        | ((uint32_t)buf[2] << 16) | ((uint32_t)buf[3] << 24);
}

uint32_t decode32be(const void *src)
{
    const unsigned char *buf = src;
    return (uint32_t)buf[3] | ((uint32_t)buf[2] << 8)
        | ((uint32_t)buf[1] << 16) | ((uint32_t)buf[0] << 24);
}

void encode32le(void *dst, uint32_t val)
{
    unsigned char *buf = dst;
    buf[0] = val;
    buf[1] = val >> 8;
    buf[2] = val >> 16;
    buf[3] = val >> 24;
}

void encode32be(void *dst, uint32_t val)
{
    unsigned char *buf = dst;
    buf[3] = val;
    buf[2] = val >> 8;
    buf[1] = val >> 16;
    buf[0] = val >> 24;
}

      

Perhaps make these functions " static inline

" and put them in a header file so that the compiler can inline them when the code is called.

You then use these functions whenever you want to write or read 32 bit integers from a memory buffer recently fetched from (or soon to write) a file or socket. This will make your code completely neutral (hence portable) and understandable, making it easier to read, develop, debug, and maintain. And in an extremely rare situation where such encoding and decoding becomes a bottleneck (this can only happen if you are using a platform with a very weak processor and a very fast network connection, i.e. not from a PC at all), you can still replace implementing these functions into some architecture-specific macros, possibly with inline assembly, without changing the rest of your code.

+3


a source


A UTF-8 string and std :: string are defined as a sequence of characters. Crypto algorithms are defined to work with a sequence of bytes / octets (C bytes are the same characters, and if your byte is not an octet then you are in an unusual implementation and you may have to be a little careful about UTF-8). The only sensible way to represent a sequence of bytes in contiguous memory is the first at the lower address and the subsequent ones at the higher addresses (array C). Crypto algorithms don't care what the bytes represent, so you're fine.

Endian-ness only matters when you are dealing with something like int

, which is not essentially a sequence of bytes. In the abstract, it is simply "something" that contains INT_MIN INT_MAX. When you come to imagine such a beast in memory, of course, it should be as few bytes, but there is no single way to do it.



In practice, the final value is important in C if you (perhaps through what you call) reinterpret a char * as int * or vice versa, or define a protocol in which int is represented using a sequence of characters. If you're only dealing with arrays of characters, or dealing with arrays of ints, it doesn't matter, because endianness is a property of ints and other types greater than char.

+7


a source


Seems to be the real questions here:

"How can I be sure that my UTF-8 encoded string will be internally represented the same on different computers?"

Because, as you said, OpenSSL routines don't really care about this (and shouldn't know them).

Since you are only asking for comments, I think you should be fine. OpenSSL procedures should behave the same for two identical pieces of data, regardless of the architecture of the computer.

+2


a source


One way to be sure is to follow the IP standard in network byte order .

Take a look here for the features you need. They should be available on Windows and * nix with modern C ++ implementations.

However, I believe your reasoning is correct and you shouldn't worry about it in this case.

Edit. To be clear, the network byte order comment suggests that you are sending data and are worried about how it will be received on the other end. If sending and receiving are all on the same machine, there should be no problem.

0


a source







All Articles