Extended Unix Code ( EUC ) is a multibyte character encoding system used primarily for Japanese , Korean , and simplified Chinese (characters) .
116-874: ISO/IEC 2022 Information technology—Character code structure and extension techniques , is an ISO / IEC standard in the field of character encoding . It is equivalent to the ECMA standard ECMA-35 , the ANSI standard ANSI X3.41 and the Japanese Industrial Standard JIS X 0202 . Originating in 1971, it was most recently revised in 1994. ISO 2022 specifies a general structure which character encodings can conform to, dedicating particular ranges of bytes ( 0x 00–1F and 0x7F–9F) to be used for non-printing control codes for formatting and in-band instructions (such as line breaks or formatting instructions for text terminals ), rather than graphical characters . It also specifies
232-507: A 7-bit or 8-bit environment), but not both. Which style of C1 invocation is used must be specified in the definition of the code version. For example, ISO/IEC 4873 specifies CR bytes for the C1 controls which it uses (SS2 and SS3). If necessary, which invocation is used may be communicated using announcer sequences . In the latter case, single control functions from the C1 control code set are invoked using "type Fe" escape sequences, meaning those where
348-449: A 94×94 coded character set (such as GB 2312 ) represented in two bytes. The EUC-CN form of GB 2312 and EUC-KR are examples of such two-byte EUC codes. EUC-JP includes characters represented by up to three bytes, including an initial shift code , whereas a single character in EUC-TW can take up to four bytes. Modern applications are more likely to use UTF-8 , which supports all of
464-415: A collaboration agreement that allow "key industry players to negotiate in an open workshop environment" outside of ISO in a way that may eventually lead to development of an ISO standard. EUC-JP The most commonly used EUC codes are variable-length encodings with a character belonging to an ISO/IEC 646 compliant coded character set (such as ASCII ) taking one byte, and a character belonging to
580-533: A document is submitted directly for approval as a draft International Standard (DIS) to the ISO member bodies or as a final draft International Standard (FDIS), if the document was developed by an international standardizing body recognized by the ISO Council. The first step, a proposal of work (New Proposal), is approved at the relevant subcommittee or technical committee (e.g., SC 29 and JTC 1 respectively in
696-486: A double-byte DBCS-Host mode using shifting sequences (where 0x29 switches to single-byte mode and 0x28 switches to double-byte mode). Also similarly to KEIS, JIS X 0208 codes are represented the same as in EUC-JP. The lead byte range is extended back to 0x41, with 0x80–0xA0 designated for user definition; lead bytes 0x41–0x7F are assigned row numbers 101 through 163 for kuten purposes, although row 162 (lead byte 0x7E)
812-512: A graphical set designation sequence, if the second I byte (for a single-byte set) or the third I byte (for a double-byte set) is 0x20 (space), the set denoted is a " dynamically redefinable character set " (DRCS) defined by prior agreement, which is also considered private use. A graphical set being considered a DRCS implies that it represents a font of exact glyphs, rather than a set of abstract characters. The manner in which DRCS sets and associated fonts are transmitted, allocated and managed
928-555: A line. Furthermore, the escape sequences declaring the national character sets may be absent if a specific ISO-2022-based encoding permits or requires this, and dictates that particular national character sets are to be used. For example, ISO-8859-1 states that no defining escape sequence is needed. To represent large character sets, ISO/IEC 2022 builds on ISO/IEC 646 's property that a seven-bit character representation will normally be able to represent 94 graphic (printable) characters (in addition to space and 33 control characters); if only
1044-442: A long process that commonly starts with the proposal of new work within a committee. Some abbreviations used for marking a standard with its status are: Abbreviations used for amendments are: Other abbreviations are: International Standards are developed by ISO technical committees (TC) and subcommittees (SC) by a process with six steps: The TC/SC may set up working groups (WG) of experts for
1160-533: A proposal to form a new global standards body. In October 1946, ISA and UNSCC delegates from 25 countries met in London and agreed to join forces to create the International Organization for Standardization. The organization officially began operations on 23 February 1947. ISO Standards were originally known as ISO Recommendations ( ISO/R ), e.g., " ISO 1 " was issued in 1951 as "ISO/R 1". ISO
1276-436: A relatively small number of standards, ISO standards are not available free of charge, but rather for a purchase fee, which has been seen by some as unaffordable for small open-source projects. The process of developing standards within ISO was criticized around 2007 as being too difficult for timely completion of large and complex standards, and some members were failing to respond to ballots, causing problems in completing
SECTION 10
#17330857521881392-447: A single byte, regardless of the number of bytes used for graphical characters. CJK encodings used in 7-bit environments which use ISO 2022 mechanisms to switch between character sets are often given names starting with "ISO-2022-", most notably ISO-2022-JP , although some other CJK encodings such as EUC-JP also make use of ISO 2022 mechanisms. Since the first 256 code points of Unicode were taken from ISO 8859-1 , Unicode inherits
1508-568: A standard document; however, registration does not create a new ISO standard, does not commit the ISO or IEC to adopt it as an international standard, and does not commit the ISO or IEC to add any of its characters to the Universal Coded Character Set . ISO-IR registered escape sequences are also used encapsulated in a Formal Public Identifier to identify character sets used for numeric character references in SGML (ISO 8879). For example,
1624-584: A subset include the Mac OS Korean script (known as Code page 10003 or x-mac-korean ), which was used by HangulTalk (MacOS-KH), the Korean localization of the classic Mac OS . It was developed by Elex Computer ( 일렉스 ), who were at the time the authorised distributor of Apple Macintosh computers in South Korea. HangulTalk adds extension characters with lead bytes between 0xA1 and 0xAD, both in unused space within
1740-559: A syntax for escape sequences, multiple-byte sequences beginning with the ESC control code, which can likewise be used for in-band instructions. Specific sets of control codes and escape sequences designed to be used with ISO 2022 include ISO/IEC 6429 , portions of which are implemented by ANSI.SYS and terminal emulators . ISO 2022 itself also defines particular control codes and escape sequences which can be used for switching between different coded character sets (for example, between ASCII and
1856-637: Is "to develop worldwide Information and Communication Technology (ICT) standards for business and consumer applications." There was previously also a JTC 2 that was created in 2009 for a joint project to establish common terminology for "standardization in the field of energy efficiency and renewable energy sources". It was later disbanded. As of 2022 , there are 167 national members representing ISO in their country, with each country having only one member. ISO has three membership categories, Participating members are called "P" members, as opposed to observing members, who are called "O" members. ISO
1972-414: Is a variable-length encoding used to represent the elements of three Japanese character set standards , namely JIS X 0208 , JIS X 0212 , and JIS X 0201 . Other names for this encoding include Unixized JIS (or UJIS ) and AT&T JIS . 0.1% of all web pages use EUC-JP since September 2022, while 2.6% of websites written with Japanese use this second-most popular (for Japanese) encoding (which
2088-451: Is a variable-length encoding which may use up to four bytes per character, due to an even larger encoding space being required. Being an extension of GBK, it is a superset of EUC-CN but is not itself a true EUC code. Being a Unicode encoding, its repertoire is identical to that of other Unicode transformation formats such as UTF-8 . Other EUC-CN variants deviating from the EUC mechanism include
2204-554: Is a different, unrelated, EUC-KR extension. Unified Hangul Code extends EUC-KR by using codes that do not conform to the EUC structure to incorporate additional syllable blocks, completing the coverage of the composed syllable blocks available in Johab and Unicode. The W3C / WHATWG Encoding Standard used by HTML5 incorporates the Unified Hangul Code extensions into its definition of EUC-KR. Other encodings incorporating EUC-KR as
2320-516: Is a family of 8-bit profiles of ISO/IEC 2022 , as opposed to 7-bit profiles such as ISO-2022-JP . As such, only ISO 2022 compliant character sets can have EUC forms. Up to four coded character sets (referred to as G0, G1, G2, and G3 or as code sets 0, 1, 2, and 3) can be represented with the EUC scheme. The G0 set is set to an ISO/IEC 646 compliant coded character set such as ASCII , ISO 646:KR ( KS X 1003 ) or ISO 646:JP (the lower half of JIS X 0201 ) and invoked over GL (i.e. 0x21–0x7E, with
2436-491: Is a variant of Shift JIS . HP-16 encodes JIS X 0208 using the same bytes as in EUC-JP, but does not use the single shift codes (thus omitting code sets 2 and 3), and adds three user-defined regions which do not follow the packed-format EUC structure: The IKIS (Interactive Kanji Information System) encoding used by Data General resembles EUC-JP without single shifts, i.e. with only code sets 0 and 1. Half-width katakana are instead included in row 8 of JIS X 0208 (colliding with
SECTION 20
#17330857521882552-462: Is a voluntary organization whose members are recognized authorities on standards, each one representing one country. Members meet annually at a General Assembly to discuss the strategic objectives of ISO. The organization is coordinated by a central secretariat based in Geneva . A council with a rotating membership of 20 member bodies provides guidance and governance, including setting the annual budget of
2668-464: Is abused, ISO should halt the process... ISO is an engineering old boys club and these things are boring so you have to have a lot of passion ... then suddenly you have an investment of a lot of money and lobbying and you get artificial results. The process is not set up to deal with intensive corporate lobbying and so you end up with something being a standard that is not clear. International Workshop Agreements (IWAs) are documents that establish
2784-508: Is an abbreviation for "International Standardization Organization" or a similar title in another language, the letters do not officially represent an acronym or initialism . The organization provides this explanation of the name: Because 'International Organization for Standardization' would have different acronyms in different languages (IOS in English, OIN in French), our founders decided to give it
2900-971: Is an independent, non-governmental , international standard development organization composed of representatives from the national standards organizations of member countries. Membership requirements are given in Article 3 of the ISO Statutes. ISO was founded on 23 February 1947, and (as of July 2024 ) it has published over 25,000 international standards covering almost all aspects of technology and manufacturing. It has over 800 technical committees (TCs) and subcommittees (SCs) to take care of standards development. The organization develops and publishes international standards in technical and nontechnical fields, including everything from manufactured products and technology to food safety, transport, IT, agriculture, and healthcare. More specialized topics like electrical and electronic engineering are instead handled by
3016-512: Is approved as an International Standard (IS) if a two-thirds majority of the P-members of the TC/SC is in favour and not more than one-quarter of the total number of votes cast are negative. After approval, the document is published by the ISO central secretariat , with only minor editorial changes introduced in the publication process before the publication as an International Standard. Except for
3132-488: Is available. This allows for sets of 94 graphical characters, or 8836 (94 ) characters, or 830584 (94 ) characters. Although initially 0x20 and 0x7F were always the space and delete character and 0xA0 and 0xFF were unused, later editions of ISO/IEC 2022 allowed the use of the bytes 0xA0 and 0xFF (or 0x20 and 0x7F) within sets under certain circumstances, allowing the inclusion of 96-character sets. The ranges 0x00–1F and 0x80–9F are used for C0 and C1 control codes . EUC
3248-432: Is extended back to 0x59, out of which the lead bytes 0x81–A0 are designated for user-defined characters, and the remainder are used for corporate-defined characters, including both kanji and non-kanji. JEF (Japanese-processing Extended Feature) is an EBCDIC encoding used on Fujitsu FACOM mainframes, contrasting with FMR (a variant of Shift JIS) used on Fujitsu PCs. Like KEIS, JEF is a stateful encoding, switching to
3364-522: Is funded by a combination of: International standards are the main products of ISO. It also publishes technical reports, technical specifications, publicly available specifications, technical corrigenda (corrections), and guides. International standards Technical reports For example: Technical and publicly available specifications For example: Technical corrigenda ISO guides For example: ISO documents have strict copyright restrictions and ISO charges for most copies. As of 2020 ,
3480-643: Is in turn conformed to by ISO/IEC 8859 , and Extended Unix Code , which is used for East Asian languages. More specialised applications of ISO 2022 include the MARC-8 encoding system used in MARC 21 library records. The escape sequences for switching to particular character sets or encodings are registered with the ISO-IR registry (except for those set apart for private use, the meanings of which are defined by vendors, or by protocol specifications such as ARIB STD-B24 ) and follow
3596-1122: Is more than for Shift JIS both are much less used that UTF-8 ). It is called Code page 954 by IBM. Microsoft has two code page numbers for this encoding (51932 and 20932). This encoding scheme allows the easy mixing of 7-bit ASCII and 8-bit Japanese without the need for the escape characters employed by ISO-2022-JP , which is based on the same character set standards, and without ASCII bytes appearing as trail bytes (unlike Shift JIS ). A related and partially compatible encoding, called EUC-JISx0213 or EUC-JIS-2004 , encodes JIS X 0201 and JIS X 0213 (similarly to Shift_JISx0213 , its Shift_JIS-based counterpart). Compared to EUC-CN or EUC-KR, EUC-JP did not become as widely adopted on PC and Macintosh systems in Japan, which used Shift JIS or its extensions ( Windows code page 932 on Microsoft Windows , and MacJapanese on classic Mac OS ), although it became heavily used by Unix or Unix-like operating systems (except for HP-UX ). Therefore, whether Japanese websites use EUC-JP or Shift_JIS often depends on what OS
ISO/IEC 2022 - Misplaced Pages Continue
3712-524: Is not ISO 2022 –compliant and therefore not a true EUC code. (It uses an 8-bit lead byte but distinguishes between a second byte with its most significant bit set and one with its most significant bit cleared, and is, therefore, more similar in structure to Big5 and other non–ISO 2022–compliant DBCS encoding systems.) The non-GB2312 portion of the 748 code contains traditional and Hong Kong characters and other glyphs used in newspaper typesetting. IBM code page 1381 ( CCSID 1381) comprises
3828-599: Is not normally used in a 7-bit ISO 2022 code version, although a variant form called HZ (which delimits GB 2312 text with ASCII sequences) was sometimes used on USENET . An ASCII character is represented in its usual encoding. A character from GB 2312 is represented by two bytes, both from the range 0xA1–0xFE. An encoding related to EUC-CN is the "748" code used in the WITS typesetting system developed by Beijing's Founder Technology (now obsoleted by its newer FITS typesetting system). The 748 code contains all of GB 2312 , but
3944-430: Is not required to be left-padded with null bytes (similarly to the packed format). JIS X 0208 is, as usual, used for code set 1; code set 2 (half-width katakana) is absent; code set 3 is encoded like the two-byte fixed width format (i.e. without a shift byte and with only the first high bit set), but used for two-byte user defined characters rather than being specified for JIS X 0212. In the basic "DEC Kanji" encoding, only
4060-417: Is not stipulated by ISO/IEC 2022 / ECMA-35 itself, although it recommends allocating them sequentially starting with F byte 0x40 ( @ ); however, a manner for transmitting DRCS fonts is defined within some telecommunication protocols such as World System Teletext . There are also three special cases for multi-byte codes. The code sequences ESC $ @ , ESC $ A , and ESC $ B were all registered when
4176-473: Is now preferred for new use, solving problems with consistency between platforms and vendors. A common extension of EUC-KR is the Unified Hangul Code ( 통합형 한글 코드 ; Tonghabhyeong Hangeul Kodeu , or 통합 완성형 ; Tonghab Wansunghyung ), which is the default Korean codepage on Microsoft Windows. It is given the code page number 949 by Microsoft, and 1261 or 1363 by IBM. IBM's code page 949
4292-425: Is produced, for example, for audio and video coding standards is called a verification model (VM) (previously also called a "simulation and test model"). When a sufficient confidence in the stability of the standard under development is reached, a working draft (WD) is produced. This is in the form of a standard, but is kept internal to working group for revision. When a working draft is sufficiently mature and
4408-568: Is registered with the IANA in both formats, the packed format as "EUC-JP" or "csEUCPkdFmtJapanese" and the fixed width format as "csEUCFixWidJapanese". Only the packed format is included in the WHATWG Encoding Standard used by HTML5 . EUC-CN is the usual encoded form of the GB 2312 standard for simplified Chinese characters . Unlike the case of Japanese JIS X 0208 and ISO-2022-JP , GB 2312
4524-551: Is required that any C0 character set include the ESC character at position 0x1B, so that further changes are possible. The control set designation sequences (as opposed to the graphical set ones) may also be used from within ISO/IEC 10646 (UCS/Unicode), in contexts where processing ANSI escape codes is appropriate, provided that each byte in the sequence is padded to the code unit size of the encoding. A table of escape sequence I bytes and
4640-590: Is restricted. The organization that is known today as ISO began in 1926 as the International Federation of the National Standardizing Associations ( ISA ), which primarily focused on mechanical engineering . The ISA was suspended in 1942 during World War II but, after the war, the ISA was approached by the recently-formed United Nations Standards Coordinating Committee (UNSCC) with
4756-451: Is set (equivalent to adding 128 to each 7-bit coding byte, or adding 160 to each number in the kuten code); this allows the software to easily distinguish whether a particular byte in a character string belongs to the ISO 646 code or the extended code. Characters in code sets 2 and 3 are prefixed with the control codes SS2 (0x8E) and SS3 (0x8F) respectively, and invoked over GR. Besides
ISO/IEC 2022 - Misplaced Pages Continue
4872-570: Is sometimes referred to as the EUC packed format , which is the encoding format usually labeled as EUC. However, internal processing of EUC data may make use of a fixed-length transformation format called the EUC complete two-byte format . This represents: Initial bytes of 0x00 and 0x80 are used in cases where the code set uses only one byte. There is also a four-byte fixed-length format. These fixed-length encoding formats are suited to internal processing and are not usually encountered in interchange. EUC-JP
4988-424: Is the inclusion of two extensions to the basic GB 2312-80 set in rows 6 and 8. These are considered "standard extensions to GB 2312", neither of which is proprietary to Apple: the row 8 extension was taken from GB 6345.1 , both extensions are included by GB/T 12345 (the traditional Chinese variant of GB 2312), and both extensions are included by GB 18030 (the successor to GB 2312). EUC-JP
5104-404: Is unused. Rows 101 through 148 are used for extended kanji, while rows 149 through 163 are used for extended non-kanji. EUC-KR is a variable-length encoding to represent Korean text using two coded character sets, KS X 1001 (formerly KS C 5601) and either ISO 646 :KR ( KS X 1003 , formerly KS C 5636 ) or ASCII , depending on variant. KS X 2901 (formerly KS C 5861 ) stipulates
5220-676: The English alphabet ), and does not provide good support for languages which use additional letters, or which use a different writing system altogether. Other writing systems with relatively few characters, such as Greek , Cyrillic , Arabic or Hebrew , as well as forms of the Latin script using diacritics or letters absent from the ISO Basic Latin alphabet, have historically been represented on personal computers with different 8- bit , single byte , extended ASCII encodings, which follow ASCII when
5336-500: The ISO/IEC 8859 series technically conform to the EUC structure, they are rarely labeled as EUC. However, eucTH is used on Solaris as a label for TIS-620 . EUC-TW is a variable-length encoding that supports ASCII and 16 planes of CNS 11643 , each of which is 94×94. It is a rarely used encoding for traditional Chinese characters as used in Taiwan . Variants of Big5 are much more common than EUC-TW, although Big5 only encodes
5452-630: The International Electrotechnical Commission . It is headquartered in Geneva , Switzerland. The three official languages of ISO are English , French , and Russian . The International Organization for Standardization in French is Organisation internationale de normalisation and in Russian, Международная организация по стандартизации ( Mezhdunarodnaya organizatsiya po standartizatsii ). Although one might think ISO
5568-469: The Mac OS Chinese Simplified script (known as Code page 10008 or x-mac-chinesesimp ). It uses the bytes 0x80, 0x81, 0x82, 0xA0, 0xFD, 0xFE, and 0xFF for the U with umlaut (ü), two special font metric characters, the non-breaking space , the copyright sign (©), the trademark sign (™) and the ellipsis (...) respectively. This differs in what is regarded as a single-byte character versus
5684-531: The VT100 , and are thus supported by terminal emulators . By default, GL codes specify G0 characters and GR codes (where available) specify G1 characters; this may be otherwise specified by prior agreement. The set invoked over each area may also be modified with control codes referred to as shifts, as shown in the table below. An 8-bit code may have GR codes specifying G1 characters, i.e. with its corresponding 7-bit code using Shift In and Shift Out to switch between
5800-560: The most significant bit is 0 (i.e. bytes 0x00–7F, when represented in hexadecimal ), and include additional characters for a most significant bit of 1 (i.e. bytes 0x80–FF). Some of these, such as the ISO 8859 series, conform to ISO 2022, while others such as DOS code page 437 do not, usually due to not reserving the bytes 0x80–9F for control codes. Certain East Asian languages, specifically Chinese , Japanese , and Korean (collectively " CJK "), are written using far more characters than
5916-489: The 0x20/A0 and 0x7F/FF bytes are actually assigned by the set; some examples of graphical character sets which are registered as 96-sets but do not use those bytes include the G1 set of I.S. 434 , the box drawing set from ISO/IEC 10367 , and ISO-IR-164 (a subset of the G1 set of ISO-8859-8 with only the letters, used by CCITT ). Characters are expected to be spacing characters, not combining characters, unless specified otherwise by
SECTION 50
#17330857521886032-451: The C0 control codes (narrowly defined) are excluded, this can be expanded to 96 characters. Using two bytes, it is thus possible to represent up to 8,836 (94×94) characters; and, using three bytes, up to 830,584 (94×94×94) characters. Though the standard defines it, no registered character set uses three bytes (although EUC-TW 's unregistered G2 does, as does the similarly unregistered CCCII ). For
6148-574: The C0 set, besides the ten included by ISO 6429 / ECMA-48 (namely SOH, STX, ETX, EOT, ENQ, ACK, DLE, NAK, SYN and ETB), or inclusion of any of those ten in the C1 set, is also prohibited by the ISO/IEC 2022 / ECMA-35 standard. A C0 control set is invoked over the CL range 0x00 through 0x1F, whereas a C1 control function may be invoked over the CR range 0x80 through 0x9F (in an 8-bit environment) or by using escape sequences (in
6264-464: The CR range always either invokes the secondary (C1) controls or is unused. The delete character DEL (0x7F), the escape character ESC (0x1B) and the space character SP (0x20) are designated "fixed" coded characters and are always available when G0 is invoked over GL, irrespective of what character sets are designated. They may not be included in graphical character sets, although other sizes or types of whitespace character may be. Sequences using
6380-445: The ESC (escape) character take the form ESC [ I ...] F , where the ESC character is followed by zero or more intermediate bytes ( I ) from the range 0x20–0x2F, and one final byte ( F ) from the range 0x30–0x7E. The first I byte, or absence thereof, determines the type of escape sequence; it might, for instance, designate a working set, or denote a single control function. In all types of escape sequences, F bytes in
6496-557: The ESC (escape) control character at 0x1B (a C0 set containing only ESC is registered as ISO-IR-104), whereas a C1 control set may not contain the escape control whatsoever. Hence, they are entirely separate registrations, with a C0 set being only a C0 set and a C1 set being only a C1 set. If codes from the C0 set of ISO 6429 / ECMA-48, i.e. the ASCII control codes , appear in the C0 set, they are required to appear at their ISO 6429 / ECMA-48 locations. Inclusion of transmission control characters in
6612-538: The ESC control character is followed by a byte from columns 04 or 05 (that is to say, ESC 0x40 (@) through ESC 0x5F (_) ). Additional control functions are assigned to "type Fs" escape sequences (in the range ESC 0x60 (`) through ESC 0x7E (~) ); these have permanently assigned meanings rather than depending on the C0 or C1 designations. Registration of control functions to type "Fs" sequences must be approved by ISO/IEC JTC 1/SC 2 . Other single control functions may be registered to type "3Ft" escape sequences (in
6728-568: The EUC-KR GR plane (trail bytes 0xA1–0xFE), and using non-EUC codes outside of it (trail bytes 0x41–0xA0). Some of these characters are font-style-independent stylized dingbats . Many of these characters do not have exact Unicode mappings, and Apple software maps these cases variously to combining sequences , to approximate mappings with an appended private-use character as a modifier for round-trip purposes, or to private-use characters. Apple also uses certain single-byte codes outside of
6844-449: The EUC-KR plane for additional characters: 0x80 for a required space , 0x81 for a won sign (₩), 0x82 for an en dash (–), 0x83 for a copyright sign (©), 0x84 for a wide underscore (_) and 0xFF for an ellipsis (...). Although none of these additional single-byte codes are within the lead byte range of plain EUC-KR (unlike Apple's extensions to EUC-CN, see above ), some are within
6960-584: The IBM-selected and user-defined characters. GBK is an extension to GB 2312 . It defines an extended form of the EUC-CN encoding capable of representing a larger array of CJK characters sourced largely from Unicode 1.1 , including traditional Chinese characters and characters used only in Japanese . It is not, however, a true EUC code, because ASCII bytes may appear as trail bytes (and C1 bytes , not limited to
7076-556: The ISO-IR registry is specified by ISO/IEC 2375 . Each registration receives a unique escape sequence, and a unique registry entry number to identify it. For example, the CCITT character set for Simplified Chinese is known as ISO-IR-165 . Registration of coded character sets with the ISO-IR registry identifies the documents specifying the character set or control function associated with an ISO/IEC 2022 non‑private-use escape sequence. This may be
SECTION 60
#17330857521887192-433: The ISO/IEC 2022 / ECMA-35 standard itself. They may be described elsewhere using hexadecimal , as is often used in this article, or using the corresponding ASCII characters, although the escape sequences are actually defined in terms of byte values, and the graphic assigned to that byte value may be altered without affecting the control sequence. Byte values from the 7-bit ASCII graphic range (hexadecimal 0x20–0x7F), being on
7308-549: The Japanese JIS X 0208 ) so as to use multiple in a single document, effectively combining them into a single stateful encoding (a feature less important since the advent of Unicode ). It is designed to be usable in both 8-bit environments and 7-bit environments (those where only seven bits are usable in a byte, such as e-mail without 8BITMIME ). The ASCII character set supports the ISO Basic Latin alphabet (equivalent to
7424-500: The author uses. Characters are encoded as follows: Vendor extensions to EUC-JP (from, for example, the Open Software Foundation , IBM or NEC ) were often allocated within the individual code sets, as opposed to using invalid EUC sequences (as in popular extensions of EUC-CN and EUC-KR). However, some vendor-specific encodings are partially compatible with EUC-JP, due to encoding JIS X 0208 over GR, but do not follow
7540-536: The basis that it leaves the graphical character repertoire undefined. ISO/IEC 4873 / ECMA-43 does, however, permit the use of the GCC function provided that the sequence of characters is kept the same and merely displayed in one space, rather than being over-stamped to form a character with a different meaning. Control character sets are classified as "primary" or "secondary" control code sets, respectively also called "C0" and "C1" control code sets. A C0 control set must contain
7656-418: The box-drawing characters added to the standard in 1983). JIS X 0208 rows 9 through 12 are used for user-defined characters. KEIS (Kanji-processing Extended Information System) is an EBCDIC encoding used by Hitachi , with double-byte characters (a DBCS-Host encoding) included using shifting sequences, making it a stateful encoding. Specifically, the sequence 0x0A 0x41 switches to single-byte mode and
7772-479: The case of MPEG, the Moving Picture Experts Group ). A working group (WG) of experts is typically set up by the subcommittee for the preparation of a working draft (e.g., MPEG is a collection of seven working groups as of 2023). When the scope of a new work is sufficiently clarified, some of the working groups may make an open request for proposals—known as a "call for proposals". The first document that
7888-418: The central secretariat. The technical management board is responsible for more than 250 technical committees , who develop the ISO standards. ISO has a joint technical committee (JTC) with the International Electrotechnical Commission (IEC) to develop standards relating to information technology (IT). Known as JTC 1 and entitled "Information technology", it was created in 1987 and its mission
8004-501: The concept of C0 and C1 control codes from ISO 2022, although it adds other non-printing characters besides the ISO 2022 control codes. However, Unicode transformation formats such as UTF-8 generally deviate from the ISO 2022 structure in various ways, including: ISO 2022 escape sequences do, however, exist for switching to and from UTF-8 as a " coding system different from that of ISO 2022 ", which are supported by certain terminal emulators such as xterm . ISO/IEC 2022 specifies
8120-421: The confidence people have in the standards setting process", and alleged that ISO did not carry out its responsibility. He also said that Microsoft had intensely lobbied many countries that traditionally had not participated in ISO and stacked technical committees with Microsoft employees, solution providers, and resellers sympathetic to Office Open XML: When you have a process built on trust and when that trust
8236-509: The contemporary version of the standard allowed multi-byte sets only in G0, so must be accepted in place of the sequences ESC $ ( @ through ESC $ ( B to designate to the G0 character set. There are additional (rarely used) features for switching control character sets, but this is a single-level lookup, in that (as noted above) the C0 set is always invoked over CL, and the C1 set is always invoked over CR or by using escape codes. As noted above, it
8352-432: The designation or other function which they perform is below. Note that the registry of F bytes is independent for the different types. The 94-character graphic set designated by ESC ( A through ESC + A is not related in any way to the 96-character set designated by ESC - A through ESC / A . And neither of those is related to the 94-character set designated by ESC $ ( A through ESC $ + A , and so on;
8468-413: The document, the draft is then approved for submission as a Final Draft International Standard (FDIS) if a two-thirds majority of the P-members of the TC/SC are in favour and if not more than one-quarter of the total number of votes cast are negative. ISO will then hold a ballot among the national bodies where no technical changes are allowed (a yes/no final approval ballot), within a period of two months. It
8584-691: The double-byte component as Code page 971 , and to EUC-KR with ASCII as Code page 970 . It is implemented as Code page 20949 ("Korean Wansung") and Code page 51949 ("EUC Korean") by Microsoft. As of April 2024 , less than 0.08% of all web pages globally use EUC-KR, but 4.6% of South Korean web pages use EUC-KR, Including extensions, it is the most widely used legacy character encoding in Korea on all three major platforms ( macOS , other Unix-like OSes, and Windows), but its use has been very slowly shifting to UTF-8 as it gains popularity, especially on Linux and macOS. As with most other encodings, UTF-8
8700-557: The encoding and RFC 1557 dubbed it as EUC-KR. A character drawn from KS X 1001 (G1, code set 1) is encoded as two bytes in GR (0xA1–0xFE) and a character from KS X 1003 or ASCII (G0, code set 0) takes one byte in GL (0x21–0x7E). It is usually referred to as Wansung ( Korean : 완성 ; RR : Wanseong ; lit. precomposed ) in the Republic of Korea . IBM refers to
8816-513: The escape sequences listed below, whereas the others are part of a C0 or C1 control code set (as shown below, SI (LS0) and SO (LS1) are C0 controls and SS2 and SS3 are C1 controls), meaning that their coding and availability may vary depending on which control sets are designated: they must be present in the designated control sets if their functionality is used. The C1 controls themselves, as mentioned above, may be represented using escape sequences or 8-bit bytes, but not both. Alternative encodings of
8932-576: The final bytes must be interpreted in context. (Indeed, without any intermediate bytes, ESC A is a way of specifying the C1 control code 0x81.) International Organization for Standardization Early research and development: Merging the networks and creating the Internet: Commercialization, privatization, broader access leads to the modern Internet: Examples of Internet services: The International Organization for Standardization ( ISO / ˈ aɪ s oʊ / )
9048-648: The first 31 rows of code set 3 are used for user-defined characters: rows 32 through 94 are reserved, similarly to the unused rows in code set 1. The "Super DEC Kanji" encoding accepts codes both from the "DEC Kanji" encoding and from packed-format EUC, for a total of five code-sets. It also allows the entire user defined code set, and the unused rows at the ends of the JIS X 0208 and JIS X 0212 code sets (rows 85–94 and 78–94 respectively), to be used for user-defined characters. Hewlett-Packard defines an encoding referred to as "HP-16". This accompanies their "HP-15" encoding, which
9164-402: The first byte of a two-byte character from both EUC (where, of those, 0xFD and 0xFE are defined as lead bytes) and GBK (where, of those, 0x81, 0x82, 0xFD and 0xFE are defined as lead bytes). This use of 0xA0, 0xFD, 0xFE and 0xFF matches Apple's Shift_JIS variant . Besides these changes to the lead byte range, the other distinctive feature of the double-byte portion of Mac OS Chinese Simplified
9280-674: The following: A specific implementation does not have to implement all of the standard; the conformance level and the supported character sets are defined by the implementation. Although many of the mechanisms defined by the ISO/IEC 2022 standard are infrequently used, several established encodings are based on a subset of the ISO/IEC 2022 system. In particular, 7-bit encoding systems using ISO/IEC 2022 mechanisms include ISO-2022-JP (or JIS encoding ), which has primarily been used in Japanese-language e-mail . 8-bit encoding systems conforming to ISO/IEC 2022 include ISO/IEC 4873 (ECMA-43), which
9396-485: The form ESC ( ! F have been assigned. At the other extreme, no multibyte 96-sets have been registered, so the sequences below are strictly theoretical. As with other escape sequence types, the range 0x30–0x3F is reserved for private-use F bytes, in this case for private-use character set definitions (which might include unregistered sets defined by protocols such as ARIB STD-B24 or MARC-8 , or vendor-specific sets such as DEC Special Graphics ). However, in
9512-464: The glyphs of the EUC codes, and more, and is generally more portable with fewer vendor deviations and errors. EUC is however still very popular, especially EUC-KR for South Korea. The structure of EUC is based on the ISO/IEC 2022 standard, which specifies a system of graphical character sets that can be represented with a sequence of the 94 7-bit bytes 0x 21–7E, or alternatively 0xA1–FE if an eighth bit
9628-483: The graphical set in question. ISO 2022 / ECMA-35 also recognizes the use of the backspace and carriage return control characters as means of combining otherwise spacing characters, as well as the CSI sequence "Graphic Character Combination" (GCC) ( CSI 0x20 (SP) 0x5F (_) ). Use of the backspace and carriage return in this manner is permitted by ISO/IEC 646 but prohibited by ISO/IEC 4873 / ECMA-43 and by ISO/IEC 8859 , on
9744-465: The initial shift code, any byte outside of the range 0xA0–0xFF appearing in a character from code sets 1 through 3 is not a valid EUC code. The EUC code itself does not make use of the announcement and designation sequences from ISO 2022 . However, the code specification is equivalent to the following sequence of four ISO 2022 announcement sequences, with meanings breaking down as follows. The ISO-2022-based variable-length encoding described above
9860-504: The lead byte range of Unified Hangul Code (specifically, 0x81, 0x82, 0x83 and 0x84). Similarly to KS X 1001, the North Korean KPS 9566 standard is typically used in EUC form; in these contexts, it is sometimes referred to as EUC-KP. More recent editions of the standard extend the EUC representation with characters using non-EUC two-byte codes, in a similar manner to Unified Hangul Code. Although certain single-byte encodings such as
9976-418: The left side of a character code table, are referred to as "GL" codes (with "GL" standing for "graphics left") while bytes from the "high ASCII" range (0xA0–0xFF), if available (i.e. in an 8-bit environment), are referred to as the "GR" codes ("graphics right") . The terms "CL" (0x00–0x1F) and "CR" (0x80–0x9F) are defined for the control ranges, but the CL range always invokes the primary (C0) controls, whereas
10092-520: The maximum of 256 which can be represented in a single byte, and were first represented on computers with language-specific double-byte encodings or variable-width encodings ; some of these (such as the Simplified Chinese encoding GB 2312 ) conform to ISO 2022 , while others (such as the Traditional Chinese encoding Big5 ) do not. Control codes in ISO 2022 are always represented with
10208-486: The most significant bit cleared). If ASCII is used, this makes the code an extended ASCII encoding; the most common deviation from ASCII is that 0x5C ( backslash in ASCII) is often used to represent a yen sign in EUC-JP (see below) and a won sign in EUC-KR. The other code sets are invoked over GR (i.e. with the most significant bit set). Hence, to get the EUC form of a character, the most significant bit of each coding byte
10324-708: The necessary steps within the prescribed time limits. In some cases, alternative processes have been used to develop standards outside of ISO and then submit them for its approval. A more rapid "fast-track" approval procedure was used in ISO/IEC JTC 1 for the standardization of Office Open XML (OOXML, ISO/IEC 29500, approved in April 2008), and another rapid alternative "publicly available specification" (PAS) process had been used by OASIS to obtain approval of OpenDocument as an ISO/IEC standard (ISO/IEC 26300, approved in May 2006). As
10440-489: The next stage, called the "enquiry stage". After a consensus to proceed is established, the subcommittee will produce a draft international standard (DIS), and the text is submitted to national bodies for voting and comment within a period of five months. A document in the DIS stage is available to the public for purchase and may be referred to with its ISO DIS reference number. Following consideration of any comments and revision of
10556-486: The packed EUC structure. Often, these do not include use of the single shifts from EUC-JP, and are thus not straight extensions of EUC-JP, with the exception of Super DEC Kanji. Digital Equipment Corporation defines two variants of EUC-JP only partly conforming to the EUC packed format, but also bearing some resemblance to the complete two-byte format. The overall format of the "DEC Kanji" encoding mostly corresponds to fixed-length (complete two-byte) EUC; however, code set 0
10672-408: The patterns defined within the standard. Character encodings making use of these escape sequences require data to be processed sequentially in a forward direction, since the correct interpretation of the data depends on previously encountered escape sequences. Specific profiles such as ISO-2022-JP may impose extra conditions, such as that the current character set is reset to US-ASCII before the end of
10788-411: The preparation of a working drafts. Subcommittees may have several working groups, which may have several Sub Groups (SG). It is possible to omit certain stages, if there is a document with a certain degree of maturity at the start of a standardization project, for example, a standard developed by another organization. ISO/IEC directives also allow the so-called "Fast-track procedure". In this procedure,
10904-762: The range ESC 0x23 (#) [ I ...] 0x40 (@) through ESC 0x23 (#) [ I ...] 0x7E (~) ), although no "3Ft" sequences are currently assigned (as of 2019). Some of these are specified in ECMA-35 (ISO 2022 / ANSI X3.41), others in ECMA-48 (ISO 6429 / ANSI X3.64). ECMA-48 refers to these as "independent control functions". Escape sequences of type "Fp" ( ESC 0x30 (0) through ESC 0x3F (?) ) or of type "3Fp" ( ESC 0x23 (#) [ I ...] 0x30 (0) through ESC 0x23 (#) [ I ...] 0x3F (?) ) are reserved for single private use control codes, by prior agreement between parties. Several such sequences of both types are used by DEC terminals such as
11020-442: The range 0x20–0x2F, then by a single byte in the range 0x40–0x7E, the entire sequence being called a "control sequence". Each of the four working sets G0 through G3 may be a 94-character set or a 94-character multi-byte set . Additionally, G1 through G3 may be a 96- or 96-character set. In a 96- or 96-character set, the bytes 0x20 through 0x7F when GL-invoked, or 0xA0 through 0xFF when GR-invoked, are allocated to and may be used by
11136-426: The range 0x30–0x3F are reserved for unregistered private uses defined by prior agreement between parties. Control functions from some sets may make use of further bytes following the escape sequence proper. For example, the ISO 6429 control function " Control Sequence Introducer ", which can be represented using an escape sequence, is followed by zero or more bytes in the range 0x30–0x3F, then zero or more bytes in
11252-667: The same pair of C0 control characters (0x0F and 0x0E) as the names "shift in" (SI) and "shift out" (SO). However, the standard refers to them as LS0 and LS1 when they are used in 8-bit environments and as SI and SO when they are used in 7-bit environments. The ISO/IEC 2022 / ECMA-35 standard permits, but discourages, invoking G1, G2 or G3 in both GL and GR simultaneously. The ISO International register of coded character sets to be used with escape sequences (ISO-IR) lists graphical character sets, control code sets, single control codes and so forth which have been registered for use with ISO/IEC 2022. The procedure for registering codes and sets with
11368-499: The sequence 0x0A 0x42 switches to double-byte mode. However, JIS X 0208 characters are encoded using the same byte sequences used to encode them in EUC-JP. This results in duplicate encodings for the ideographic space —0x4040 per the DBCS-Host code structure, and 0xA1A1 as in EUC-JP. This differs from IBM's DBCS-Host encoding for Japanese, the layout of which builds on versions which predate JIS X 0208 altogether. The lead byte range
11484-439: The set is single-byte or multi-byte (although not how many bytes it uses if it is multi-byte), and also whether each byte has 94 or 96 permitted values. ISO/IEC 2022 coding specifies a two-layer mapping between character codes and displayed characters. Escape sequences allow any of a large registry of graphic character sets to be "designated" into one of four working sets, named G0 through G3, and shorter control sequences specify
11600-402: The set. In a 94- or 94-character set, the bytes 0x20 and 0x7F are not used. When a 96- or 96-character set is invoked in the GL region, the space and delete characters (codes 0x20 and 0x7F) are not available until a 94- or 94-character set (such as the G0 set) is invoked in GL. 96-character sets cannot be designated to G0. Registration of a set as a 96-character set does not necessarily mean that
11716-422: The sets (e.g. JIS X 0201 ), although some instead have GR codes specifying G2 characters, with the corresponding 7-bit code using a single-shift code to access the second set (e.g. T.51 ). The codes shown in the table below are the most common encodings of these control codes, conforming to ISO/IEC 6429 . The LS2, LS3, LS1R, LS2R and LS3R shifts are registered as single control functions and are always encoded as
11832-468: The short form ISO . ISO is derived from the Greek word isos ( ίσος , meaning "equal"). Whatever the country, whatever the language, the short form of our name is always ISO . During the founding meetings of the new organization, however, the Greek word explanation was not invoked, so this meaning may be a false etymology . Both the name ISO and the ISO logo are registered trademarks and their use
11948-477: The single shifts, may appear as lead or trail bytes), due to a larger encoding space being required. Variants of GBK are implemented by Windows code page 936 (the Microsoft Windows code page for simplified Chinese), and by IBM's code page 1386. The Unicode-based GB 18030 character encoding defines an extension of GBK capable of encoding the entirety of Unicode . However, Unicode encoded as GB 18030
12064-433: The single-byte code page 1115 (CPGID 1115 as CCSID 1115) and the double-byte code page 1380 (CPGID 1380 as CCSID 1380), which encodes GB 2312 the same way as EUC-CN, but deviates from the EUC structure by extending the lead byte range back to 0x8C, adding 31 IBM-selected characters in 0x8CE0 through 0x8CFE and adding 1880 user-defined characters with lead bytes 0x8D through 0xA0. IBM code page 1383 (CCSID 1383) comprises
12180-463: The single-byte code page 367 and the double-byte code page 1382 (CPGID 1382 as CCSID 1382), which differs by conforming to the EUC structure, adding the 31 IBM-selected characters in 0xFEE0 through 0xFEFE instead, and including only 1360 user-defined characters, interspersed in the positions not used by GB 2312. The alternative CCSID 5479 is used for the pure EUC-CN code page: it uses CCSID 9574 as its double-byte set, which uses CPGID 1382 but excludes
12296-400: The single-shift area. This must be specified in the definition of the code version. For instance, ISO/IEC 4873 specifies GL, whereas packed EUC specifies GR. In 7-bit environments, only GL is used as the single-shift area. If necessary, which single-shift area is used may be communicated using announcer sequences . The names "locking shift zero" (LS0) and "locking shift one" (LS1) refer to
12412-461: The single-shifts as C0 control codes are available in certain control code sets. For example, SS2 and SS3 are usually available at 0x19 and 0x1D respectively in T.51 and T.61 . This coding is currently recommended by ISO/IEC 2022 / ECMA-35 for applications requiring 7-bit single-byte representations of SS2 and SS3, and may also be used for SS2 only, although older code sets with SS2 at 0x1C also exist, and were mentioned as such in an earlier edition of
12528-517: The standard. The 0x8E and 0x8F coding of the single shifts as shown below is mandatory for ISO/IEC 4873 levels 2 and 3. Although officially considered shift codes and named accordingly, single-shift codes are not always viewed as shifts, and they may simply be viewed as prefix bytes (i.e. the first bytes in a multi-byte sequence), since they do not require the encoder to keep the currently active set as state , unlike locking shift codes. In 8-bit environments, either GL or GR, but not both, may be used as
12644-523: The string ISO 646-1983//CHARSET International Reference Version (IRV)//ESC 2/5 4/0 can be used to identify the International Reference Version of ISO 646 -1983, and the HTML 4.01 specification uses ISO Registration Number 177//CHARSET ISO/IEC 10646-1:1993 UCS-4 with implementation level 3//ESC 2/5 2/15 4/6 to identify Unicode. The textual representation of the escape sequence, included in
12760-501: The subcommittee is satisfied that it has developed an appropriate technical document for the problem being addressed, it becomes a committee draft (CD) and is sent to the P-member national bodies of the SC for the collection of formal comments. Revisions may be made in response to the comments, and successive committee drafts may be produced and circulated until consensus is reached to proceed to
12876-454: The third element of the FPI, will be recognised by SGML implementations for supported character sets. Escape sequences to designate character sets take the form ESC I [ I ...] F . As mentioned above, the intermediate ( I ) bytes are from the range 0x20–0x2F, and the final ( F ) byte is from the range 0x30–0x7E. The first I byte (or, for a multi-byte set, the first two) identifies
12992-419: The two-byte character sets, the code point of each character is normally specified in so-called row-cell or kuten form, which comprises two numbers between 1 and 94 inclusive, specifying a row and cell of that character within the zone. For a three-byte set, an additional plane number is included at the beginning. The escape sequences do not only declare which character set is being used, but also whether
13108-423: The type of character set and the working set it is to be designated to, whereas the F byte (and any additional I bytes) identify the character set itself, as assigned in the ISO-IR register (or, for the private-use escape sequences, by prior agreement). Additional I bytes may be added before the F byte to extend the F byte range. This is currently only used with 94-character sets, where codes of
13224-414: The typical cost of a copy of an ISO standard is about US$ 120 or more (and electronic copies typically have a single-user license, so they cannot be shared among groups of people). Some standards by ISO and its official U.S. representative (and, via the U.S. National Committee, the International Electrotechnical Commission ) are made freely available. A standard published by ISO/IEC is the last stage of
13340-408: The working set that is "invoked" to interpret bytes in the stream. Encoding byte values ("bit combinations") are often given in column-line notation , where two decimal numbers in the range 00–15 (each corresponding to a single hexadecimal digit) are separated by a slash. Hence, for instance, codes 2/0 (0x20) through 2/15 (0x2F) inclusive may be referred to as "column 02". This is the notation used in
13456-513: Was suggested at the time by Martin Bryan, the outgoing convenor (chairman) of working group 1 (WG1) of ISO/IEC JTC 1/SC 34 , the rules of ISO were eventually tightened so that participating members that fail to respond to votes are demoted to observer status. The computer security entrepreneur and Ubuntu founder, Mark Shuttleworth , was quoted in a ZDNet blog article in 2008 about the process of standardization of OOXML as saying: "I think it de-values
#187812