Computer's in My Life: Bytes were not always 8 bits

There were machines, once upon a time, using other word sizes, but today for non-eight-bitness you must look to museum pieces, specialized chips for embedded applications, and DSPs. How did the byte evolve out of the chaos and creativity of the early days of computer design?
I can imagine that fewer bits would be ineffective for handling enough data to make computing feasible, while too many would have lead to expensive hardware. Were other influences in play? Why did these forces balance out to eight bits?
(BTW, if I could time travel, I'd go back to when the "byte" was declared to be 8 bits, and convince everyone to make it 12 bits, bribing them with some early 21st Century trinkets.)
http://softwareengineering.stackexchange.com/questions/120126/what-is-the-history-of-why-bytes-are-eight-bits

Historically, bytes haven't always been 8-bit in size (for that matter, computers don't have to be binary either, but non-binary computing has seen much less action in practice). It is for this reason that IETF and ISO standards often use the term octet - they don't use byte because they don't want to assume it means 8-bits when it doesn't.

Indeed, when byte was coined it was defined as a 1-6 bit unit. Byte-sizes in use throughout history include 7, 9, 36 and machines with variable-sized bytes.

8 was a mixture of commercial success, it being a convenient enough number for the people thinking about it (which would have fed into each other) and no doubt other reasons I'm completely ignorant of.

The ASCII standard you mention assumes a 7-bit byte, and was based on earlier 6-bit communication standards.

Edit: It may be worth adding to this, as some are insisting that those saying bytes are always octets, are confusing bytes with words.

An octet is a name given to a unit of 8 bits (from the Latin for eight). If you are using a computer (or at a higher abstraction level, a programming language) where bytes are 8-bit, then this is easy to do, otherwise you need some conversion code (or coversion in hardware). The concept of octet comes up more in networking standards than in local computing, because in being architecture-neutral it allows for the creation of standards that can be used in communicating between machines with different byte sizes, hence its use in IETF and ISO standards (incidentally, ISO/IEC 10646 uses octet where the Unicode Standard uses byte for what is essentially - with some minor extra restrictions on the latter part - the same standard, though the Unicode Standard does detail that they mean octet by byte even though bytes may be different sizes on different machines). The concept of octet exists precisely because 8-bit bytes are common (hence the choice of using them as the basis of such standards) but not universal (hence the need for another word to avoid ambiguity).

Historically, a byte was the size used to store a character, a matter which in turn builds on practices, standards and de-facto standards which pre-date computers used for telex and other communication methods, starting perhaps with Baudot in 1870 (I don't know of any earlier, but am open to corrections).

This is reflected by the fact that in C and C++ the unit for storing a byte is called char whose size in bits is defined by CHAR_BIT in the standard limits.h header. Different machines would use 5,6,7,8,9 or more bits to define a character. These days of course we define characters as 21-bit and use different encodings to store them in 8-, 16- or 32-bit units, (and non-Unicode authorised ways like UTF-7 for other sizes) but historically that was the way it was.

In languages which aim to be more consistent across machines, rather than reflecting the machine architecture, byte tends to be fixed in the language, and these days this generally means it is defined in the language as 8-bit. Given the point in history when they were made, and that most machines now have 8-bit bytes, the distinction is largely moot, though it's not impossible to implement a compiler, run-time, etc. for such languages on machines with different sized bytes, just not as easy.

A word is the "natural" size for a given computer. This is less clearly defined, because it affects a few overlapping concerns that would generally coïncide, but might not. Most registers on a machine will be this size, but some might not. The largest address size would typically be a word, though this may not be the case (the Z80 had an 8-bit byte and a 1-byte word, but allowed some doubling of registers to give some 16-bit support including 16-bit addressing).

Again we see here a difference between C and C++ where int is defined in terms of word-size and long being defined to take advantage of a processor which has a "long word" concept should such exist, though possibly being identical in a given case to int. The minimum and maximum values are again in the limits.h header. (Indeed, as time has gone on, int may be defined as smaller than the natural word-size, as a combination of consistency with what is common elsewhere, reduction in memory usage for an array of ints, and probably other concerns I don't know of).

Java and .NET languages take the approach of defining int and long as fixed across all architecutres, and making dealing with the differences an issue for the runtime (particularly the JITter) to deal with. Notably though, even in .NET the size of a pointer (in unsafe code) will vary depending on architecture to be the underlying word size, rather than a language-imposed word size.

Hence, octet, byte and word are all very independent of each other, despite the relationship of octet == byte and word being a whole number of bytes (and a whole binary-round number like 2, 4, 8 etc.) being common today.

A lot of really early work was done with 5-bit baudot codes, but those quickly became quite limiting (only 32 possible characters, so basically only upper-case letters, and a few punctuation marks, but not enough "space" for digits).

From there, quite a few machines went to 6-bit characters. This was still pretty inadequate though -- if you wanted upper- and lower-case (English) letters and digits, that left only two more characters for punctuation, so most still had only one case of letters in a character set.

ASCII defined a 7-bit character set. That was "good enough" for a lot of uses for a long time, and has formed the basis of most newer character sets as well (ISO 646, ISO 8859, Unicode, ISO 10646, etc.)

Binary computers motivate designers to making sizes powers of two. Since the "standard" character set required 7 bits anyway, it wasn't much of a stretch to add one more bit to get a power of 2 (and by then, storage was becoming enough cheaper that "wasting" a bit for most characters was more acceptable as well).

Since then, character sets have moved to 16 and 32 bits, but most mainstream computers are largely based on the original IBM PC (a design that'll be 30 years old within the next few months). Then again, enough of the market is sufficiently satisfied with 8-bit characters that even if the PC hadn't come to its current level of dominance, I'm not sure everybody would do everything with larger characters anyway.

I should also add that the market has changed quite a bit. In the current market, the character size is defined less by the hardware than the software. Windows, Java, etc., moved to 16-bit characters long ago.

Now, the hindrance in supporting 16- or 32-bit characters is only minimally from the difficulties inherent in 16- or 32-bit characters themselves, and largely from the difficulty of supporting i18n in general. In ASCII (for example) detecting whether a letter is upper or lower case, or converting between the two, is incredibly trivial. In full Unicode/ISO 10646, it's basically indescribably complex (to the point that the standards don't even try -- they give tables, not descriptions). Then you add in the fact that for some languages/character sets, even the basic idea of upper/lower case doesn't apply. Then you add in the fact that even displaying characters in some of those is much more complex still.

That's all sufficiently complex that the vast majority of software doesn't even try. The situation is slowly improving, but slowly is the operative word.

Take a look at Wikipedia page on 8-bit architecture. Although character sets could have been 5-, 6-, then 7-bit, underlying CPU/memory bus architecture always used powers of 2. Very first Microprocessor (around 1970s) had 4-bit bus, which means one instruction could move 4-bits of data between external memory and the CPU.
Then with release of 8080 processor, 8-bit architecture became popular and that's what gave the beginnings of x86 assembly instruction set which is used even to these days. If I had to guess, byte came from these early processors where mainstream public began accepting and playing with PCs and 8-bits was considered the standard size of a single unit of data.
Since then bus size has been doubling but it always remained a power of 2 (i.e. 16-, 32- and now 64-bits) Actually, I'm sure the internals of today's bus are much more complicated than simply 64 parallel wires, but current mainstream CPU architecture is 64-bits.
I would assume that by always doubling (instead of growing 50%) it was easier to make new hardware that coexists with existing applications and other legacy components. So for example when they went from 8-bits to 16, each instruction could now move 2 bytes instead of 1, so you save yourself one clock cycle but then end result is the same. However, if you went from 8 to 12-bit architecture, you'd end breaking up original data into halfs and managing that could become annoying. These are just guesses, I'm not really a hardware expert.

Computer's in My Life

Wednesday, December 7, 2016

Bytes were not always 8 bits

No comments: