Extended Unix Code ( EUC ) is a multibyte character encoding system used primarily for Japanese , Korean , and simplified Chinese (characters) .
114-507: JIS X 0208 is a 2-byte character set specified as a Japanese Industrial Standard , containing 6879 graphic characters suitable for writing text, place names, personal names, and so forth in the Japanese language . The official title of the current standard is 7-bit and 8-bit double byte coded KANJI sets for information interchange ( 7ビット及び8ビットの2バイト情報交換用符号化漢字集合 , Nana-Bitto Oyobi Hachi-Bitto no Ni-Baito Jōhō Kōkan'yō Fugōka Kanji Shūgō ) . It
228-402: A byte order mark or escape sequences ; compressing schemes try to minimize the number of bytes used per code unit (such as SCSU and BOCU ). Although UTF-32BE and UTF-32LE are simpler CESes, most systems working with Unicode use either UTF-8 , which is backward compatible with fixed-length ASCII and maps Unicode code points to variable-length sequences of octets, or UTF-16BE , which
342-437: A string of the letters "ab̲c𐐀"—that is, a string containing a Unicode combining character ( U+0332 ̲ COMBINING LOW LINE ) as well as a supplementary character ( U+10400 𐐀 DESERET CAPITAL LETTER LONG I ). This string has several Unicode representations which are logically equivalent, yet while each is suited to a diverse set of circumstances or range of requirements: Note in particular that 𐐀
456-595: A Japanese-language common name ( 日本語通用名称 , Nihongo tsūyō meishō ) , but some provisions for these names do not exist. The names of kanji, on the other hand, are mechanically set according to the corresponding hexadecimal representation of their code in UCS/Unicode. The name of a kanji can be arrived at by prepending the Unicode codepoint with "CJK UNIFIED IDEOGRAPH-". For example, row 16 cell 1 ( 亜 ) corresponds to U+4E9C in UCS, so
570-404: A cell number (each numbered from 1 to 94, for a standard JIS X 0208 code) form a kuten ( 区点 ) point, which is used to represent double-byte code points. A code number or kuten number ( 区点番号 , kuten bangō ) is expressed in the form "row-cell", the row and cell numbers being separated by a hyphen . For example, the character " 亜 " has a code point at row 16, cell 1, so its code number
684-405: A character belonging to an ISO/IEC 646 compliant coded character set (such as ASCII ) taking one byte, and a character belonging to a 94×94 coded character set (such as GB 2312 ) represented in two bytes. The EUC-CN form of GB 2312 and EUC-KR are examples of such two-byte EUC codes. EUC-JP includes characters represented by up to three bytes, including an initial shift code , whereas
798-486: A double-byte DBCS-Host mode using shifting sequences (where 0x29 switches to single-byte mode and 0x28 switches to double-byte mode). Also similarly to KEIS, JIS X 0208 codes are represented the same as in EUC-JP. The lead byte range is extended back to 0x41, with 0x80–0xA0 designated for user definition; lead bytes 0x41–0x7F are assigned row numbers 101 through 163 for kuten purposes, although row 162 (lead byte 0x7E)
912-499: A fixed-length transformation format called the EUC complete two-byte format . This represents: Initial bytes of 0x00 and 0x80 are used in cases where the code set uses only one byte. There is also a four-byte fixed-length format. These fixed-length encoding formats are suited to internal processing and are not usually encountered in interchange. EUC-JP is registered with the IANA in both formats,
1026-470: A particular byte in a character string belongs to the ISO 646 code or the extended code. Characters in code sets 2 and 3 are prefixed with the control codes SS2 (0x8E) and SS3 (0x8F) respectively, and invoked over GR. Besides the initial shift code, any byte outside of the range 0xA0–0xFF appearing in a character from code sets 1 through 3 is not a valid EUC code. The EUC code itself does not make use of
1140-420: A particular sequence of bits. Instead, characters would first be mapped to a universal intermediate representation in the form of abstract numbers called code points . Code points would then be represented in a variety of ways and with various default numbers of bits per character (code units) depending on context. To encode code points higher than the length of the code unit, such as above 256 for eight-bit units,
1254-404: A process known as transcoding . Some of these are cited below. Cross-platform : Windows : The most used character encoding on the web is UTF-8 , used in 98.2% of surveyed web sites, as of May 2024. In application programs and operating system tasks, both UTF-8 and UTF-16 are popular options. EUC-JP The most commonly used EUC codes are variable-length encodings with
SECTION 10
#17328560863861368-461: A sequence of the 94 7-bit bytes 0x 21–7E, or alternatively 0xA1–FE if an eighth bit is available. This allows for sets of 94 graphical characters, or 8836 (94 ) characters, or 830584 (94 ) characters. Although initially 0x20 and 0x7F were always the space and delete character and 0xA0 and 0xFF were unused, later editions of ISO/IEC 2022 allowed the use of the bytes 0xA0 and 0xFF (or 0x20 and 0x7F) within sets under certain circumstances, allowing
1482-461: A single glyph . The former simplifies the text handling system, but the latter allows any letter/diacritic combination to be used in text. Ligatures pose similar problems. Exactly how to handle glyph variants is a choice that must be made when constructing a particular character encoding. Some writing systems, such as Arabic and Hebrew, need to accommodate things like graphemes that are joined in different ways in different contexts, but represent
1596-606: A single character in EUC-TW can take up to four bytes. Modern applications are more likely to use UTF-8 , which supports all of the glyphs of the EUC codes, and more, and is generally more portable with fewer vendor deviations and errors. EUC is however still very popular, especially EUC-KR for South Korea. The structure of EUC is based on the ISO/IEC 2022 standard, which specifies a system of graphical character sets that can be represented with
1710-546: A single character per code unit. However, due to the emergence of more sophisticated character encodings, the distinction between these terms has become important. "Code page" is a historical name for a coded character set. Originally, a code page referred to a specific page number in the IBM standard character set manual, which would define a particular character encoding. Other vendors, including Microsoft , SAP , and Oracle Corporation , also published their own sets of code pages;
1824-426: A single decimal number. The double-byte codes are laid out in 94 numbered groups, each called a row ( 区 , ku , lit. "section") . Every row contains 94 numbered codes, each called a cell ( 点 , ten , lit. "point") . This makes a total of 8836 (94 × 94) possible code points (although not all are assigned, see below); these are laid out in the standard in a 94-line, 94-column code table. A row number and
1938-432: A stream of octets (bytes). The purpose of this decomposition is to establish a universal set of characters that can be encoded in a variety of ways. To describe this model precisely, Unicode uses its own set of terminology to describe its process: An abstract character repertoire (ACR) is the full set of abstract characters that a system supports. Unicode has an open repertoire, meaning that new characters will be added to
2052-584: A subset include the Mac OS Korean script (known as Code page 10003 or x-mac-korean ), which was used by HangulTalk (MacOS-KH), the Korean localization of the classic Mac OS . It was developed by Elex Computer ( 일렉스 ), who were at the time the authorised distributor of Apple Macintosh computers in South Korea. HangulTalk adds extension characters with lead bytes between 0xA1 and 0xAD, both in unused space within
2166-623: A subset of the characters used in written languages , sometimes restricted to upper case letters , numerals and some punctuation only. The advent of digital computer systems allows more elaborate encodings codes (such as Unicode ) to support hundreds of written languages. The most popular character encoding on the World Wide Web is UTF-8 , which is used in 98.2% of surveyed web sites, as of May 2024. In application programs and operating system tasks, both UTF-8 and UTF-16 are popular options. The history of character codes illustrates
2280-532: A variant form called HZ (which delimits GB 2312 text with ASCII sequences) was sometimes used on USENET . An ASCII character is represented in its usual encoding. A character from GB 2312 is represented by two bytes, both from the range 0xA1–0xFE. An encoding related to EUC-CN is the "748" code used in the WITS typesetting system developed by Beijing's Founder Technology (now obsoleted by its newer FITS typesetting system). The 748 code contains all of GB 2312 , but
2394-505: A well-defined and extensible encoding system, has replaced most earlier character encodings, but the path of code development to the present is fairly well known. The Baudot code, a five- bit encoding, was created by Émile Baudot in 1870, patented in 1874, modified by Donald Murray in 1901, and standardized by CCITT as International Telegraph Alphabet No. 2 (ITA2) in 1930. The name baudot has been erroneously applied to ITA2 and its many variants. ITA2 suffered from many shortcomings and
SECTION 20
#17328560863862508-423: Is backward compatible with fixed-length UCS-2BE and maps Unicode code points to variable-length sequences of 16-bit words. See comparison of Unicode encodings for a detailed discussion. Finally, there may be a higher-level protocol which supplies additional information to select the particular variant of a Unicode character, particularly where there are regional variants that have been 'unified' in Unicode as
2622-414: Is a variable-length encoding used to represent the elements of three Japanese character set standards , namely JIS X 0208 , JIS X 0212 , and JIS X 0201 . Other names for this encoding include Unixized JIS (or UJIS ) and AT&T JIS . 0.1% of all web pages use EUC-JP since September 2022, while 2.6% of websites written with Japanese use this second-most popular (for Japanese) encoding (which
2736-405: Is a variable-length encoding which may use up to four bytes per character, due to an even larger encoding space being required. Being an extension of GBK, it is a superset of EUC-CN but is not itself a true EUC code. Being a Unicode encoding, its repertoire is identical to that of other Unicode transformation formats such as UTF-8 . Other EUC-CN variants deviating from the EUC mechanism include
2850-578: Is a common extension. It is used (with minor variations, noted in footnotes) by Windows-932 (which is matched by the WHATWG Encoding Standard used by HTML5 ), by the PostScript variant (but, since KanjiTalk version 7, not the regular variant) of MacJapanese , and by JIS X 0213 (the successor to JIS X 0208). Unlike the other extensions made by Windows-932/WHATWG and JIS X 0213, the two match rather than colliding, so decoding of most of this row
2964-554: Is a different, unrelated, EUC-KR extension. Unified Hangul Code extends EUC-KR by using codes that do not conform to the EUC structure to incorporate additional syllable blocks, completing the coverage of the composed syllable blocks available in Johab and Unicode. The W3C / WHATWG Encoding Standard used by HTML5 incorporates the Unified Hangul Code extensions into its definition of EUC-KR. Other encodings incorporating EUC-KR as
3078-491: Is a variant of Shift JIS . HP-16 encodes JIS X 0208 using the same bytes as in EUC-JP, but does not use the single shift codes (thus omitting code sets 2 and 3), and adds three user-defined regions which do not follow the packed-format EUC structure: The IKIS (Interactive Kanji Information System) encoding used by Data General resembles EUC-JP without single shifts, i.e. with only code sets 0 and 1. Half-width katakana are instead included in row 8 of JIS X 0208 (colliding with
3192-436: Is also the structure used by CNS 11643 , and related to the structure used by CCCII . Among the 2-byte codes, rows 9 to 15 and 85 to 94 are unassigned code points ( 空き領域 , aki ryōiki ) ; that is, they are code points with no characters assigned to them. Also, some cells in other rows are also essentially unassigned code points. These empty areas contain code points that should basically not be used. Except when there
3306-538: Is also used in the Mainland Chinese GB 2312 , where it is natively known as 区位 ; qūwèi , and the South Korean KS C 5601 (currently KS X 1001 ), where the ku and ten are respectively known as hang ( 행 ; 行 ; haeng ) and yol ( 열 ; 列 ; yeol ). The later JIS X 0213 extends this structure by having more than one plane ( 面 , men , lit. "face") of rows, which
3420-447: Is better supported than the other extensions made by JIS X 0213. In order to represent code points , column/line numbers are used for one-byte codes and kuten numbers are used for two-byte codes. For a way to identify a character without depending on a code, character names are used. Almost all JIS X 0208 graphic character codes are represented with two bytes of at least seven bits each. However, every control character , as well as
3534-442: Is defined by a CEF. A character encoding scheme (CES) is the mapping of code units to a sequence of octets to facilitate storage on an octet-based file system or transmission over an octet-based network. Simple character encoding schemes include UTF-8 , UTF-16BE , UTF-32BE , UTF-16LE , and UTF-32LE ; compound character encoding schemes, such as UTF-16 , UTF-32 and ISO/IEC 2022 , switch between several simple schemes by using
JIS X 0208 - Misplaced Pages Continue
3648-444: Is defined by the encoding. Thus, the number of code units required to represent a code point depends on the encoding: Exactly what constitutes a character varies between character encodings. For example, for letters with diacritics , there are two distinct approaches that can be taken to encode them: they can be encoded either as a single unified character (known as a precomposed character), or as separate characters that combine into
3762-496: Is different from the arrangement of katakana in JIS X 0201. In JIS X 0201, the syllabary starts with wo ( ヲ ) , followed by the small kana sorted by gojūon order, followed by the full-size kana, also in gojūon order ( ヲァィゥェォャュョッーアイウエオ......ラリルレロワン ). On the other hand, in JIS X 0208, the kana are sorted first by gojūon order, then in the order of "small kana, full-size kana, kana with dakuten, and kana with handakuten" such that
3876-423: Is due to these incompatibilities. Ever since the first standard, it has been possible to represent composites ( 合成 , gōsei ) such as encircled numbers , ligatures for measurement unit names, and Roman numerals ; they were not given independent kuten code points. Although individual companies that manufacture information systems can make an effort to represent these characters as customers may require by
3990-432: Is extended back to 0x59, out of which the lead bytes 0x81–A0 are designated for user-defined characters, and the remainder are used for corporate-defined characters, including both kanji and non-kanji. JEF (Japanese-processing Extended Feature) is an EBCDIC encoding used on Fujitsu FACOM mainframes, contrasting with FMR (a variant of Shift JIS) used on Fujitsu PCs. Like KEIS, JEF is a stateful encoding, switching to
4104-431: Is included in JIS X 0208, but the semantics of these terms vary from person to person. The first encoding byte corresponds to the row or cell number plus 0x20, or 32 in decimal (see below). Hence, the code set starting with 0x21 has a row number of 1, and its cell 1 has a continuation byte of 0x21 (or 33), and so forth. For lead bytes used for characters other than kanji , links are provided to charts on this page listing
4218-1122: Is more than for Shift JIS both are much less used that UTF-8 ). It is called Code page 954 by IBM. Microsoft has two code page numbers for this encoding (51932 and 20932). This encoding scheme allows the easy mixing of 7-bit ASCII and 8-bit Japanese without the need for the escape characters employed by ISO-2022-JP , which is based on the same character set standards, and without ASCII bytes appearing as trail bytes (unlike Shift JIS ). A related and partially compatible encoding, called EUC-JISx0213 or EUC-JIS-2004 , encodes JIS X 0201 and JIS X 0213 (similarly to Shift_JISx0213 , its Shift_JIS-based counterpart). Compared to EUC-CN or EUC-KR, EUC-JP did not become as widely adopted on PC and Macintosh systems in Japan, which used Shift JIS or its extensions ( Windows code page 932 on Microsoft Windows , and MacJapanese on classic Mac OS ), although it became heavily used by Unix or Unix-like operating systems (except for HP-UX ). Therefore, whether Japanese websites use EUC-JP or Shift_JIS often depends on what OS
4332-524: Is not ISO 2022 –compliant and therefore not a true EUC code. (It uses an 8-bit lead byte but distinguishes between a second byte with its most significant bit set and one with its most significant bit cleared, and is, therefore, more similar in structure to Big5 and other non–ISO 2022–compliant DBCS encoding systems.) The non-GB2312 portion of the 748 code contains traditional and Hong Kong characters and other glyphs used in newspaper typesetting. IBM code page 1381 ( CCSID 1381) comprises
4446-430: Is not required to be left-padded with null bytes (similarly to the packed format). JIS X 0208 is, as usual, used for code set 1; code set 2 (half-width katakana) is absent; code set 3 is encoded like the two-byte fixed width format (i.e. without a shift byte and with only the first high bit set), but used for two-byte user defined characters rather than being specified for JIS X 0212. In the basic "DEC Kanji" encoding, only
4560-473: Is now preferred for new use, solving problems with consistency between platforms and vendors. A common extension of EUC-KR is the Unified Hangul Code ( 통합형 한글 코드 ; Tonghabhyeong Hangeul Kodeu , or 통합 완성형 ; Tonghab Wansunghyung ), which is the default Korean codepage on Microsoft Windows. It is given the code page number 949 by Microsoft, and 1261 or 1363 by IBM. IBM's code page 949
4674-476: Is often used to represent a yen sign in EUC-JP (see below) and a won sign in EUC-KR. The other code sets are invoked over GR (i.e. with the most significant bit set). Hence, to get the EUC form of a character, the most significant bit of each coding byte is set (equivalent to adding 128 to each 7-bit coding byte, or adding 160 to each number in the kuten code); this allows the software to easily distinguish whether
JIS X 0208 - Misplaced Pages Continue
4788-430: Is preferred, usually in the larger context of locales. IBM's Character Data Representation Architecture (CDRA) designates entities with coded character set identifiers ( CCSIDs ), each of which is variously called a "charset", "character set", "code page", or "CHARMAP". The code unit size is equivalent to the bit measurement for the particular encoding: A code point is represented by a sequence of code units. The mapping
4902-405: Is prior agreement among the relevant parties, characters ( gaiji ) for information interchange should not be assigned to the unassigned code points. Even when assigning characters to unassigned code points, graphic characters defined in the standard should not be assigned to them, and the same character should not be assigned to multiple unassigned code points; characters should not be duplicated in
5016-438: Is represented as "16-01". In 7-bit JIS X 0208 (as might be switched to in JIS X 0202 / ISO-2022-JP ), both bytes must be from the 94-byte range of 0x 21 (used for row or cell number 1) through 0x7E (used for row or cell number 94) – exactly corresponding to the range used for 7-bit ASCII printing characters, not counting the space. Accordingly, the encoded bytes are obtained by adding 0x20 (32) to each number. For instance,
5130-492: Is represented with either one 32-bit value (UTF-32), two 16-bit values (UTF-16), or four 8-bit values (UTF-8). Although each of those forms uses the same total number of bits (32) to represent the glyph, it is not obvious how the actual numeric byte values are related. As a result of having many character encoding methods in use (and the need for backward compatibility with archived data), many computer programs have been developed to translate data between character encoding schemes,
5244-424: Is the inclusion of two extensions to the basic GB 2312-80 set in rows 6 and 8. These are considered "standard extensions to GB 2312", neither of which is proprietary to Apple: the row 8 extension was taken from GB 6345.1 , both extensions are included by GB/T 12345 (the traditional Chinese variant of GB 2312), and both extensions are included by GB 18030 (the successor to GB 2312). EUC-JP
5358-446: Is the process of assigning numbers to graphical characters , especially the written characters of human language, allowing them to be stored, transmitted, and transformed using computers. The numerical values that make up a character encoding are known as code points and collectively comprise a code space, a code page , or character map . Early character codes associated with the optical or electrical telegraph could only represent
5472-404: Is unused. Rows 101 through 148 are used for extended kanji, while rows 149 through 163 are used for extended non-kanji. EUC-KR is a variable-length encoding to represent Korean text using two coded character sets, KS X 1001 (formerly KS C 5601) and either ISO 646 :KR ( KS X 1003 , formerly KS C 5636 ) or ASCII , depending on variant. KS X 2901 (formerly KS C 5861 ) stipulates
5586-497: The Cyrillic script . Compare row 7 of GB 2312 , which matches this row. Compare and contrast row 12 of KS X 1001 and row 5 of KPS 9566 , which use the same layout (but in a different row). All characters in this set were added in 1983, and were not present in the original 1978 revision of the standard. Rows 9 through 15 of the JIS X 0208 standard are left empty. However, the following layout for row 13, first introduced by NEC ,
5700-623: The Halfwidth and Fullwidth Forms block if used in an encoding which combines JIS X 0208 with ASCII or with JIS X 0201, such as EUC-JP , Shift JIS or ISO 2022-JP . Compare row 3 of KPS 9566 , which this row exactly matches. Compare and contrast row 3 of KS X 1001 and of GB 2312 , which include their entire national variants of ISO 646 in this row, rather than only the alphanumeric subset. This row contains Japanese Hiragana . Compare row 4 of GB 2312 , which matches this row. Compare and contrast row 10 of KPS 9566 and of KS X 1001 , which use
5814-549: The ISO/IEC 8859 series technically conform to the EUC structure, they are rarely labeled as EUC. However, eucTH is used on Solaris as a label for TIS-620 . EUC-TW is a variable-length encoding that supports ASCII and 16 planes of CNS 11643 , each of which is 94×94. It is a rarely used encoding for traditional Chinese characters as used in Taiwan . Variants of Big5 are much more common than EUC-TW, although Big5 only encodes
SECTION 50
#17328560863865928-469: The Mac OS Chinese Simplified script (known as Code page 10008 or x-mac-chinesesimp ). It uses the bytes 0x80, 0x81, 0x82, 0xA0, 0xFD, 0xFE, and 0xFF for the U with umlaut (ü), two special font metric characters, the non-breaking space , the copyright sign (©), the trademark sign (™) and the ellipsis (...) respectively. This differs in what is regarded as a single-byte character versus
6042-553: The final sigma . Compare row 6 of GB 2312 and GB 12345 and row 6 of KPS 9566 , which include the same Greek letters in the same layout, although GB 12345 adds vertical presentation forms and KPS 9566 adds Roman numerals. Compare and contrast row 5 of KS X 1001 , which offsets the Greek letters to include the Roman numerals first. This row contains the modern Russian alphabet and is not necessarily sufficient for representing other forms of
6156-482: The 1980s faced the dilemma that, on the one hand, it seemed necessary to add more bits to accommodate additional characters, but on the other hand, for the users of the relatively small character set of the Latin alphabet (who still constituted the majority of computer users), those additional bits were a colossal waste of then-scarce and expensive computing resources (as they would always be zeroed out for such users). In 1985,
6270-491: The 62 letters and numbers alone (e.g. 4/1 ("A") in ISO 646 becomes 2/3 4/1 (i.e. 3-33) in JIS X 0208). As to the cause of how these numerals, Latin letters, and so forth in the kanji set are the "full-width alphanumeric characters" ( 全角英数字 , zenkaku eisūji ) and how the original implementation came forth with a differing interpretation compared to the IRV, it is thought that it
6384-425: The 6349 characters of the first standard (1978). In the second and third standards, they added four and two characters to level 2, respectively, bringing the total kanji to 6355. Also, in the second standard, character forms were changed as well as transposition among the levels; in the third standard as well, character forms were changed. These are described further below. Character set Character encoding
6498-447: The EUC scheme. The G0 set is set to an ISO/IEC 646 compliant coded character set such as ASCII , ISO 646:KR ( KS X 1003 ) or ISO 646:JP (the lower half of JIS X 0201 ) and invoked over GL (i.e. 0x21–0x7E, with the most significant bit cleared). If ASCII is used, this makes the code an extended ASCII encoding; the most common deviation from ASCII is that 0x5C ( backslash in ASCII)
6612-568: The EUC-KR GR plane (trail bytes 0xA1–0xFE), and using non-EUC codes outside of it (trail bytes 0x41–0xA0). Some of these characters are font-style-independent stylized dingbats . Many of these characters do not have exact Unicode mappings, and Apple software maps these cases variously to combining sequences , to approximate mappings with an appended private-use character as a modifier for round-trip purposes, or to private-use characters. Apple also uses certain single-byte codes outside of
6726-400: The EUC-KR plane for additional characters: 0x80 for a required space , 0x81 for a won sign (₩), 0x82 for an en dash (–), 0x83 for a copyright sign (©), 0x84 for a wide underscore (_) and 0xFF for an ellipsis (...). Although none of these additional single-byte codes are within the lead byte range of plain EUC-KR (unlike Apple's extensions to EUC-CN, see above ), some are within
6840-534: The IBM-selected and user-defined characters. GBK is an extension to GB 2312 . It defines an extended form of the EUC-CN encoding capable of representing a larger array of CJK characters sourced largely from Unicode 1.1 , including traditional Chinese characters and characters used only in Japanese . It is not, however, a true EUC code, because ASCII bytes may appear as trail bytes (and C1 bytes , not limited to
6954-509: The Unicode standard is U+0000 to U+10FFFF, inclusive, divided in 17 planes , identified by the numbers 0 to 16. Characters in the range U+0000 to U+FFFF are in plane 0, called the Basic Multilingual Plane (BMP). This plane contains the most commonly-used characters. Characters in the range U+10000 to U+10FFFF in the other planes are called supplementary characters . The following table shows examples of code point values: Consider
SECTION 60
#17328560863867068-424: The above example of 16-01 ("亜") would be represented by the bytes 0x30 0x21 . The 8-bit EUC-JP instead uses the range 0xA1 through 0xFE (setting the high bit to 1), whereas other encodings such as Shift JIS use more complicated transforms. Shift JIS includes more encoding space than is needed for JIS X 0208 itself; some Shift JIS specific extensions to JIS X 0208 make use of row numbers above 94. This structure
7182-425: The adoption of electrical and electro-mechanical techniques these earliest codes were adapted to the new capabilities and limitations of the early machines. The earliest well-known electrically transmitted character code, Morse code , introduced in the 1840s, used a system of four "symbols" (short signal, long signal, short space, long space) to generate codes of variable length. Though some commercial use of Morse code
7296-443: The announcement and designation sequences from ISO 2022 . However, the code specification is equivalent to the following sequence of four ISO 2022 announcement sequences, with meanings breaking down as follows. The ISO-2022-based variable-length encoding described above is sometimes referred to as the EUC packed format , which is the encoding format usually labeled as EUC. However, internal processing of EUC data may make use of
7410-500: The author uses. Characters are encoded as follows: Vendor extensions to EUC-JP (from, for example, the Open Software Foundation , IBM or NEC ) were often allocated within the individual code sets, as opposed to using invalid EUC sequences (as in popular extensions of EUC-CN and EUC-KR). However, some vendor-specific encodings are partially compatible with EUC-JP, due to encoding JIS X 0208 over GR, but do not follow
7524-464: The average personal computer user's hard disk drive could store only about 10 megabytes, and it cost approximately US$ 250 on the wholesale market (and much higher if purchased separately at retail), so it was very important at the time to make every bit count. The compromise solution that was eventually found and developed into Unicode was to break the assumption (dating back to telegraph codes) that each character should always directly correspond to
7638-418: The box-drawing characters added to the standard in 1983). JIS X 0208 rows 9 through 12 are used for user-defined characters. KEIS (Kanji-processing Extended Information System) is an EBCDIC encoding used by Hitachi , with double-byte characters (a DBCS-Host encoding) included using shifting sequences, making it a stateful encoding. Specifically, the sequence 0x0A 0x41 switches to single-byte mode and
7752-768: The character at ISO/IEC 646 International Reference Version ( US-ASCII ) column 4 line 1 and the one at JIS X 0208 row 3 cell 33 have the name "LATIN CAPITAL LETTER A". Therefore, the character at 4/1 in ASCII and the character at 3-33 in JIS X 0208 can be regarded as the same character (although, in practice, alternative mapping is used for the JIS X 0208 character due to encodings providing ASCII separately). Conversely, ASCII characters 2/2 (quotation mark), 2/7 (apostrophe), 2/13 (hyphen-minus), and 7/14 (tilde) can be determined to be characters that do not exist in this standard. Character names of non-kanji characters use uppercase Roman letters, spaces, and hyphens. Non-kanji characters are given
7866-603: The character set are not considered compatible. Because there are places where such things have happened as the original drafting committee of the first standard taking care to separate characters between level 1 and level 2 and the second standard then shuffling some variant characters (異体字, itaiji ) between the levels, at least in the first and second standards, it is conjectured that non- kanji and level 1-only implementation Japanese computer systems were at one time considered for development. However, such implementations have never been specified as compatible, though examples such as
7980-533: The characters encoded under that lead byte. For lead bytes used for kanji, links are provided to the appropriate section of Wiktionary 's kanji index. Some vendors use slightly different Unicode mapping for this set than the one below. For example, Microsoft maps kuten 1-29 (JIS 0x213D) to U+2015 (Horizontal Bar), whereas Apple maps it to U+2014 (Em Dash). Similarly, Microsoft maps kuten 1-61 (JIS 0x215D) to U+FF0D (the fullwidth form of U+002D Hyphen-Minus), and Apple maps it to U+2212 (Minus Sign). Unicode mapping of
8094-543: The codes unassigned in JIS X 0208 are assigned by the newer JIS X 0213 standard. Each JIS X 0208 character is given a name . By using a character's name, it is possible to identify characters without relying on their codes. The names of characters are coordinated with other character set standards, notably the Universal Coded Character Set (UCS/ Unicode ), so this is one possible source of character mappings to character sets such as Unicode. For example, both
8208-437: The column number. Four low-order bits counting from zero to fifteen form the line number. Each decimal number corresponds to one hexadecimal digit. For example, the bit combination corresponding to the graphic character "space" is 010 0000 as a 7-bit number, and 0010 0000 as an 8-bit number. In column/line notation, this is represented as 2/0. Other representations of the same single-byte code include 0x20 as hexadecimal, or 32 as
8322-537: The composition of the characters, none has requested to have them added to the standard, instead choosing to proprietarily offer them as gaiji . In the fourth standard (1997), all these characters were explicitly defined as characters that accompany an advancement of the current position; that is to say, they are spacing characters . Furthermore, it was ruled that they should not be made by the composition of characters. For this reason, it became disallowed to represent Latin characters with diacritics at all, with possibly
8436-691: The double-byte component as Code page 971 , and to EUC-KR with ASCII as Code page 970 . It is implemented as Code page 20949 ("Korean Wansung") and Code page 51949 ("EUC Korean") by Microsoft. As of April 2024 , less than 0.08% of all web pages globally use EUC-KR, but 4.6% of South Korean web pages use EUC-KR, Including extensions, it is the most widely used legacy character encoding in Korea on all three major platforms ( macOS , other Unix-like OSes, and Windows), but its use has been very slowly shifting to UTF-8 as it gains popularity, especially on Linux and macOS. As with most other encodings, UTF-8
8550-500: The early NEC PC-9801 did exist. Even though there are provisions in the JIS X 0208:1997 standard concerning compatibility, at the present time, it is generally considered that this standard neither certifies compatibility nor is it an official manufacturing standard that amounts to a declaration of self-compatibility. Consequently, de facto , JIS X 0208-"compatible" products are not considered to exist. Terminology such as "conformant" ( 準拠 , junkyo ) and "support" ( 対応 , taiō )
8664-557: The encoding and RFC 1557 dubbed it as EUC-KR. A character drawn from KS X 1001 (G1, code set 1) is encoded as two bytes in GR (0xA1–0xFE) and a character from KS X 1003 or ASCII (G0, code set 0) takes one byte in GL (0x21–0x7E). It is usually referred to as Wansung ( Korean : 완성 ; RR : Wanseong ; lit. precomposed ) in the Republic of Korea . IBM refers to
8778-536: The era had their own character codes, often six-bit, but usually had the ability to read tapes produced on IBM equipment. These BCD encodings were the precursors of IBM's Extended Binary-Coded Decimal Interchange Code (usually abbreviated as EBCDIC), an eight-bit encoding scheme developed in 1963 for the IBM System/360 that featured a larger character set, including lower case letters. In trying to develop universally interchangeable character encodings, researchers in
8892-405: The evolving need for machine-mediated character-based symbolic information over a distance, using once-novel electrical means. The earliest codes were based upon manual and hand-written encoding and cyphering systems, such as Bacon's cipher , Braille , international maritime signal flags , and the 4-digit encoding of Chinese characters for a Chinese telegraph code ( Hans Schjellerup , 1869). With
9006-602: The first 31 rows of code set 3 are used for user-defined characters: rows 32 through 94 are reserved, similarly to the unused rows in code set 1. The "Super DEC Kanji" encoding accepts codes both from the "DEC Kanji" encoding and from packed-format EUC, for a total of five code-sets. It also allows the entire user defined code set, and the unused rows at the ends of the JIS X 0208 and JIS X 0212 code sets (rows 85–94 and 78–94 respectively), to be used for user-defined characters. Hewlett-Packard defines an encoding referred to as "HP-16". This accompanies their "HP-15" encoding, which
9120-402: The first byte of a two-byte character from both EUC (where, of those, 0xFD and 0xFE are defined as lead bytes) and GBK (where, of those, 0x81, 0x82, 0xFD and 0xFE are defined as lead bytes). This use of 0xA0, 0xFD, 0xFE and 0xFF matches Apple's Shift_JIS variant . Besides these changes to the lead byte range, the other distinctive feature of the double-byte portion of Mac OS Chinese Simplified
9234-514: The first line of the chart below), which were included in the original 1978 version of the standard. This set includes a subset of the ISO 646 invariant set (and therefore also a subset of both ASCII and the JIS X 0201 Roman set), minus punctuation and symbols, comprising western Arabic numerals and both cases of the Basic Latin alphabet . Characters in this set may use alternative Unicode mappings to
9348-497: The four characters. This means that the kanji set is the most widespread non-upward-compatible character set in the world; it is counted as one of the weak points of this standard. Even with the 90 special characters, numerals, and Latin letters the kanji set and the IRV set have in common, this standard does not follow the arrangement of ISO/IEC 646. These 90 characters are split between rows 1 (punctuation) and 3 (letters and numbers), although row 3 does follow ISO 646 arrangement for
9462-406: The inclusion of 96-character sets. The ranges 0x00–1F and 0x80–9F are used for C0 and C1 control codes . EUC is a family of 8-bit profiles of ISO/IEC 2022 , as opposed to 7-bit profiles such as ISO-2022-JP . As such, only ISO 2022 compliant character sets can have EUC forms. Up to four coded character sets (referred to as G0, G1, G2, and G3 or as code sets 0, 1, 2, and 3) can be represented with
9576-404: The incompatibility with the katakana of this standard. This point is also one of the weaknesses of this standard. How the kanji in this standard were chosen from what sources, why they are split into level 1 and level 2, and how they are arranged are all explained in detail in the fourth standard (1997). Per that explanation, the kanji included in the following four kanji listings were reflected in
9690-576: The kanji set. In the following table, the ISO/IEC 646:1991 IRV characters in question are compared with their multiple equivalents in JIS X 0208, except for the IRV character "TILDE", which is compared with the "WAVE DASH" of JIS X 0208. The entries under the "Symbol" columns utilize UCS/Unicode code points, so the specifics of display may differ. The ASCII/IRV characters without exact JIS X 0208 equivalents were later assigned code points by JIS X 0213 , these are also listed below, as are Microsoft's mapping of
9804-504: The lead byte range of Unified Hangul Code (specifically, 0x81, 0x82, 0x83 and 0x84). Similarly to KS X 1001, the North Korean KPS 9566 standard is typically used in EUC form; in these contexts, it is sometimes referred to as EUC-KP. More recent editions of the standard extend the EUC representation with characters using non-EUC two-byte codes, in a similar manner to Unified Hangul Code. Although certain single-byte encodings such as
9918-475: The most well-known code page suites are " Windows " (based on Windows-1252) and "IBM"/"DOS" (based on code page 437 ). Despite no longer referring to specific page numbers in a standard, many character encodings are still referred to by their code page number; likewise, the term "code page" is often still used to refer to character encodings in general. The term "code page" is not used in Unix or Linux, where "charmap"
10032-467: The name of it would be "CJK UNIFIED IDEOGRAPH-4E9C". Kanji are not given Japanese common names. JIS X 0208 prescribes a set of 6879 graphical characters that correspond to two-byte codes with either seven or eight bits to the byte; in JIS X 0208, this is called the kanji set ( 漢字集合 , kanji shūgō ) , which includes 6355 kanji as well as 524 non-kanji ( 非漢字 , hikanji ) , including characters such as Latin letters , kana , and so forth. As for
10146-486: The packed EUC structure. Often, these do not include use of the single shifts from EUC-JP, and are thus not straight extensions of EUC-JP, with the exception of Super DEC Kanji. Digital Equipment Corporation defines two variants of EUC-JP only partly conforming to the EUC packed format, but also bearing some resemblance to the complete two-byte format. The overall format of the "DEC Kanji" encoding mostly corresponds to fixed-length (complete two-byte) EUC; however, code set 0
10260-478: The packed format as "EUC-JP" or "csEUCPkdFmtJapanese" and the fixed width format as "csEUCFixWidJapanese". Only the packed format is included in the WHATWG Encoding Standard used by HTML5 . EUC-CN is the usual encoded form of the GB 2312 standard for simplified Chinese characters . Unlike the case of Japanese JIS X 0208 and ISO-2022-JP , GB 2312 is not normally used in a 7-bit ISO 2022 code version, although
10374-422: The plain space – although not the ideographic space – is represented with a one-byte code. In order to represent the bit combination ( ビット組合せ , bitto kumiawase ) of a one-byte code, two decimal numbers – a column number and a line number – are used. Three high-order bits out of seven or four high-order bits out of eight, counting from zero to seven or from zero to fifteen respectively, form
10488-412: The punched card code then in use only allowed digits, upper-case English letters and a few special characters, six bits were sufficient. These BCD encodings extended existing simple four-bit numeric encoding to include alphabetic and special characters, mapping them easily to punch-card encoding which was already in widespread use. IBM's codes were used primarily with IBM equipment; other computer vendors of
10602-460: The repertoire over time. A coded character set (CCS) is a function that maps characters to code points (each code point represents one character). For example, in a given repertoire, the capital letter "A" in the Latin alphabet might be represented by the code point 65, the character "B" by 66, and so on. Multiple coded character sets may share the same character repertoire; for example ISO/IEC 8859-1 and IBM code pages 037 and 500 all cover
10716-491: The same character. An example is the XML attribute xml:lang. The Unicode model uses the term "character map" for other systems which directly assign a sequence of characters to a sequence of bytes, covering all of the CCS, CEF and CES layers. In Unicode, a character can be referred to as 'U+' followed by its codepoint value in hexadecimal. The range of valid code points (the codespace) for
10830-403: The same code point. Consequently, limiting point 25-66 to the "mouth" form and assigning the latter "ladder" form to an unassigned code point would technically be in violation of the standard. In practice, however, several vendor-specific Shift JIS variants, including Windows-932 and MacJapanese , encode vendor extensions in unallocated rows of the encoding space for JIS X 0208. Also, most of
10944-445: The same fundamental kana is grouped with its derivatives ( ぁあぃいぅうぇえぉお......っつづ......はばぱひびぴふぶぷへべぺほぼぽ......ゎわゐゑをん ). This ordering was chosen in order to more simply facilitate the sorting of kana-based dictionary look-ups (Yasuoka, 2006). As mentioned above, in this standard, the previously defined katakana order in JIS X 0201 was not followed in JIS X 0208. It is thought that the JIS X 0201 katakana being " half-width kana " arose due to
11058-411: The same layout, but in a different row. This row contains Japanese Katakana . Compare row 5 of GB 2312 , which matches this row. Compare and contrast row 11 of KPS 9566 and of KS X 1001 , which use the same layout, but in a different row. Contrast the considerably different Katakana layout used by JIS X 0201 . This row contains basic support for the modern Greek alphabet , without diacritics or
11172-537: The same repertoire but map them to different code points. A character encoding form (CEF) is the mapping of code points to code units to facilitate storage in a system that represents numbers as bit sequences of fixed length (i.e. practically any computer system). For example, a system that stores numeric information in 16-bit units can only directly represent code points 0 to 65,535 in each unit, but larger code points (say, 65,536 to 1.4 million) could be represented by using multiple 16-bit units. This correspondence
11286-472: The same semantic character. Unicode and its parallel standard, the ISO/IEC 10646 Universal Character Set , together constitute a unified standard for character encoding. Rather than mapping characters directly to bytes , Unicode separately defines a coded character set that maps characters to unique natural numbers ( code points ), how those code points are mapped to a series of fixed-size natural numbers (code units), and finally how those units are encoded as
11400-499: The sequence 0x0A 0x42 switches to double-byte mode. However, JIS X 0208 characters are encoded using the same byte sequences used to encode them in EUC-JP. This results in duplicate encodings for the ideographic space —0x4040 per the DBCS-Host code structure, and 0xA1A1 as in EUC-JP. This differs from IBM's DBCS-Host encoding for Japanese, the layout of which builds on versions which predate JIS X 0208 altogether. The lead byte range
11514-437: The set. Furthermore, when assigning characters to unassigned code points, it is necessary to be cautious of unification in regards to kanji glyphs. For example, row 25 cell 66 corresponds to the kanji meaning "high" or "expensive"; both the form with a component resembling the "mouth" character ( 口 ) in the middle ( 高 ) and the less common form with a ladder-like construction in the same location ( 髙 ) are subsumed into
11628-477: The single shifts, may appear as lead or trail bytes), due to a larger encoding space being required. Variants of GBK are implemented by Windows code page 936 (the Microsoft Windows code page for simplified Chinese), and by IBM's code page 1386. The Unicode-based GB 18030 character encoding defines an extension of GBK capable of encoding the entirety of Unicode . However, Unicode encoded as GB 18030
11742-433: The single-byte code page 1115 (CPGID 1115 as CCSID 1115) and the double-byte code page 1380 (CPGID 1380 as CCSID 1380), which encodes GB 2312 the same way as EUC-CN, but deviates from the EUC structure by extending the lead byte range back to 0x8C, adding 31 IBM-selected characters in 0x8CE0 through 0x8CFE and adding 1880 user-defined characters with lead bytes 0x8D through 0xA0. IBM code page 1383 (CCSID 1383) comprises
11856-463: The single-byte code page 367 and the double-byte code page 1382 (CPGID 1382 as CCSID 1382), which differs by conforming to the EUC structure, adding the 31 IBM-selected characters in 0xFEE0 through 0xFEFE instead, and including only 1360 user-defined characters, interspersed in the positions not used by GB 2312. The alternative CCSID 5479 is used for the pure EUC-CN code page: it uses CCSID 9574 as its double-byte set, which uses CPGID 1382 but excludes
11970-409: The sole exception of the ångström symbol ( Å ) at row 2 cell 82. The hiragana and katakana in JIS X 0208, unlike JIS X 0201 , include dakuten and handakuten markings as part of a character. The katakana wi ( ヰ ) and we ( ヱ ) (both obsolete in modern Japanese) as well as the small wa ( ヮ ) , not in JIS X 0201, are also included. The arrangement of kana in JIS X 0208
12084-433: The solution was to implement variable-length encodings where an escape sequence would signal that subsequent bits should be parsed as a higher code point. Informally, the terms "character encoding", "character map", "character set" and "code page" are often used interchangeably. Historically, the same standard would specify a repertoire of characters and how they were to be encoded into a stream of code units — usually with
12198-572: The special characters in the kanji set, some characters from the graphic character set of the International Reference Version (IRV) of ISO/IEC 646 :1991 (equivalent to ASCII ) are absent from JIS X 0208. There are the aforementioned four characters "QUOTATION MARK", "APOSTROPHE", "HYPHEN-MINUS", and "TILDE". The former three are split into different code points in the kanji set (Nishimura, 1978; JIS X 0221-1:2001 standard, Section 3.8.7). The "TILDE" of IRV has no corresponding character in
12312-484: The wave dash also differs between vendors. See the cells with footnotes below. ASCII and JISCII punctuation (shown here with a yellow background) may use alternative mappings to the Halfwidth and Fullwidth Forms block if used in an encoding which combines JIS X 0208 with ASCII or with JIS X 0201 , such as Shift JIS , EUC-JP or ISO 2022-JP . Most of the characters in this set were added in 1983, except for characters 0x2221–0x222E (kuten 2-1 through 2-14, or
12426-504: Was adopted fairly widely. ASCII67's American-centric nature was somewhat addressed in the European ECMA-6 standard. Herman Hollerith invented punch card data encoding in the late 19th century to analyze census data. Initially, each hole position represented a different data element, but later, numeric information was encoded by numbering the lower rows 0 to 9, with a punch in a column representing its row number. Later alphabetic data
12540-667: Was encoded by allowing more than one punch per column. Electromechanical tabulating machines represented date internally by the timing of pulses relative to the motion of the cards through the machine. When IBM went to electronic processing, starting with the IBM 603 Electronic Multiplier, it used a variety of binary encoding schemes that were tied to the punch card code. IBM used several Binary Coded Decimal ( BCD ) six-bit character encoding schemes, starting as early as 1953 in its 702 and 704 computers, and in its later 7000 Series and 1400 series , as well as in associated peripherals. Since
12654-409: Was often improved by many equipment manufacturers, sometimes creating compatibility issues. In 1959 the U.S. military defined its Fieldata code, a six-or seven-bit code, introduced by the U.S. Army Signal Corps. While Fieldata addressed many of the then-modern issues (e.g. letter and digit codes arranged for machine collation), it fell short of its goals and was short-lived. In 1963 the first ASCII code
12768-540: Was originally established as JIS C 6226 in 1978, and has been revised in 1983, 1990, and 1997. It is also called Code page 952 by IBM. The 1978 version is also called Code page 955 by IBM. The character set JIS X 0208 establishes is primarily for the purpose of information interchange ( 情報交換 , jōhō kōkan ) between data processing systems and the devices connected to them, or mutually between data communication systems. This character set can be used for data processing and text processing. Partial implementations of
12882-534: Was released (X3.4-1963) by the ASCII committee (which contained at least one member of the Fieldata committee, W. F. Leubbert), which addressed most of the shortcomings of Fieldata, using a simpler code. Many of the changes were subtle, such as collatable character sets within certain numeric ranges. ASCII63 was a success, widely adopted by industry, and with the follow-up issue of the 1967 ASCII code (which added lower-case letters and fixed some "control code" issues) ASCII67
12996-576: Was via machinery, it was often used as a manual code, generated by hand on a telegraph key and decipherable by ear, and persists in amateur radio and aeronautical use. Most codes are of fixed per-character length or variable-length sequences of fixed-length codes (e.g. Unicode ). Common examples of character encoding systems include Morse code, the Baudot code , the American Standard Code for Information Interchange (ASCII) and Unicode. Unicode,
#385614