A CCSID ( coded character set identifier ) is a 16-bit number that represents a particular encoding of a specific code page . For example, Unicode is a code page that has several character encoding schemes (referred to as "transformation formats")—including UTF-8 , UTF-16 and UTF-32 —but which may or may not actually be accompanied by a CCSID number to indicate that this encoding is being used.
30-445: The terms code page and CCSID are often used interchangeably, even though they are not synonymous. A code page may be only part of what makes up a CCSID. The following definitions from IBM help to illustrate this point: The following examples show how some CCSIDs are made up of other CCSIDs. All three of these variant Shift-JIS CCSIDs are multi-byte character sets (MBCS): the single-byte character set (SBCS) portion of each CCSID
60-447: A Yen sign for JIS X 0201 compatibility. It includes several extensions, namely " NEC special characters (Row 13), NEC selection of IBM extensions (Rows 89 to 92), and IBM extensions (Rows 115 to 119)", in addition to setting some encoding space aside for end user definition . Windows codepage 932 is the version used in the W3C / WHATWG encoding standard used by HTML5 , which includes
90-504: A different set of extended special characters, based on the NEC special characters , some of which were only available in the printer versions of the fonts. Older versions of Maru Gothic and Hon Mincho from System 7.1 encoded vertical presentation forms at 10 (not 84) JIS rows down from their canonical forms, and did not include the special character extensions, this was subsequently changed. The typical variant used with KanjiTalk version 6 placed
120-486: A double-byte JIS X 0208 sequence j 1 j 2 {\displaystyle j_{1}j_{2}} , the transformation to the corresponding Shift JIS bytes s 1 s 2 {\displaystyle s_{1}s_{2}} is: The competing 8-bit format EUC-JP , which does not support single-byte halfwidth katakana, allows for a cleaner and more direct conversion to and from JIS X 0208 code points , as all high-bit-set bytes are parts of
150-647: A double-byte character and all codes from ASCII range represent single-byte characters. HTML written in Shift JIS can still be interpreted to some extent when incorrectly tagged as ASCII, and when the charset tag is in the top of the document itself, since the important start and end of HTML tags and fields ( < , > , / , " , & , ; ) are encoded as the same bytes as in ASCII, and those bytes do not appear in two-byte sequences. Shift JIS can be used in string literals in programming languages such as C , but
180-490: A few things must be taken into consideration. Firstly, that the escape character 0x5C, normally backslash , is the half-width yen sign (¥) in Shift JIS. If the programmer is aware of this, it would be possible to use printf("ハローワールド¥n"); (where ハローワールド is Hello, world and ¥n is an escape sequence), assuming the I/O system supports Shift JIS output. Secondly, the 0x5C byte will cause problems when it appears as second byte of
210-578: A given plane. The same set of characters can be represented by EUC-JIS-2004 , the EUC-JP based counterpart. Some of the additions collide with popular Shift JIS extensions, including Windows codepage 932 which is used in web standards (see above ). For example, compare plane 1 row 89 in JIS X 0213 (beginning 硃, 硎, 硏...) to row 89 in the JIS X 0208 variant defined in web standards (beginning 纊, 褜, 鍈...). In addition, some of
240-490: A two-byte character, because it will be interpreted as an escape sequence, which will mess up the interpretation, unless followed by another 0x5C. Many different versions of Shift JIS exist. There are two areas for expansion: Firstly, JIS X 0208 does not fill the whole 94×94 space encoded for it in Shift JIS, therefore there is room for more characters here—these are really extensions to JIS X 0208 rather than to Shift JIS itself. Secondly, Shift JIS has more encoding space than
270-453: Is a special editor which encodes Shift JIS this way. The chart below gives the detailed meaning of each byte in a stream encoded in standard Shift JIS (conforming to JIS X 0208:1997 ). Some of the bytes which are not used for single-byte codes or initial bytes in JIS X 0208:1997 are used by certain extensions, resulting in the layout detailed in the chart below. Single-byte encoding SBCS , or single-byte character set ,
300-489: Is based on character sets defined within JIS standards JIS X 0201 :1997 (for the single-byte characters ) and JIS X 0208 :1997 (for the double-byte characters ). As of November 2024 , 0.1% of surveyed web pages used Shift JIS (actually decoded as its superset Windows-31J encoding), a decline from 1.3% in July 2014. Shift JIS is the third-most declared character encoding for Japanese websites, used by 2.1% of sites in
330-577: Is commonly contrasted against the terms DBCS (double-byte character set) and TBCS (triple-byte character set), as well as MBCS (multi-byte character set). The multi-byte character sets are used to accommodate languages with scripts that have large numbers of characters and symbols, predominantly Asian languages such as Chinese, Japanese, and Korean. These are sometimes referred to by the acronym CJK . In these computing systems, SBCSs are traditionally associated with half-width characters, so-called because such SBCS characters would traditionally occupy half
SECTION 10
#1733092845444360-402: Is different. The double-byte character set (DBCS) portion is the same across each CCSID. CCSID 5028 uses an updated code page 897 called CCSID 4993. CCSID 932 uses the original code page 897, which is CCSID 897. CCSID 942 uses a different SBCS from the other two CCSIDs, which is 1041. Also notice how CCSID 5028 and 4993 are different by 4096 (1000 in hexadecimal) from the predecessor CCSID with
390-431: Is much scope for confusion, if the extensions are used. A variant is the one that must be used if wanting to encode Shift JIS in source code strings of C and similar programming languages. This variant doubles the byte 0x5C if it appears as second byte of a two-byte character, but not if it appears as a single "¥" (ASCII: "\") character, because 0x5C is the beginning of an escape sequence . The best way of handling this
420-507: Is needed for JIS X 0201 and JIS X 0208 (see § Shift JIS byte map below), and this space can and is used for yet more characters (as either single-byte or double-byte characters). The most popular extension is Windows code page 932 (a CCSID also used for IBM's extension to Shift JIS ), which is registered with the IANA as "Windows-31J", separately from Shift JIS. This was popularized by Microsoft, although Microsoft itself does not recognize
450-415: Is the cell ( 点 , ten , point) number (1-94). The ku and ten numbers are equivalent to j 1 − 32 {\displaystyle j_{1}-32} and j 2 − 32 {\displaystyle j_{2}-32} respectively, where j 1 j 2 {\displaystyle j_{1}j_{2}} is a two-byte JIS sequence referencing
480-507: Is used to refer to character encodings that use exactly one byte for each graphic character . An SBCS can accommodate a maximum of 256 symbols, and is useful for scripts that do not have many symbols or accented letters such as the Latin, Greek and Cyrillic scripts used mainly for European languages. Examples of SBCS encodings include ISO/IEC 646 , the various ISO 8859 encodings, and the various Microsoft / IBM code pages . The term SBCS
510-606: The Shift_JIS range 0xEB41–0xED96, at 84 JIS rows down from their canonical forms, and 260 special characters in the Shift_JIS range 0x8540–0x886D. This variant was introduced in KanjiTalk version 7. However, certain Mac OS typefaces used other variants. Sai Mincho and Chu Gothic use a " PostScript " variant of MacJapanese, which included additional vertical presentation forms and
540-531: The overline here), but the Yen sign to 0x5C (as in JIS X 0201 and standard Shift JIS ). It also extended JIS X 0201 by assigning the backslash to 0x80 (corresponding to 0x5C in US-ASCII), the non-breaking space to 0xA0, the copyright sign to 0xFD, the trademark symbol to 0xFE and the half-width horizontal ellipsis to 0xFF. It also added extended double byte characters; including 53 vertical presentation forms in
570-447: The "formerly proprietary extensions from IBM and NEC" from Windows-31J in its table for JIS X 0208, and also treats the label "shift_jis" interchangeably with "windows-31j" with the intent of being "compatible with deployed content". The version of Shift-JIS originating from the classic Mac OS (known as x-mac-japanese , Code page 10001 or MacJapanese) assigned the tilde to 0x7E (following US-ASCII , not JIS X 0201 which assigns
600-458: The .jp domain, while UTF-8 is used by 98% of Japanese websites. Shift JIS is also sometimes used in QR codes (they are a Japanese invention also allowing UTF-8, which may though be preferred use). Shift JIS is an extension of the single-byte encoding JIS X 0201 :1997 , that uses unassigned code points in JIS X 0201 to encode the double-byte JIS X 0208 :1997 character set. The lead bytes for
630-648: The Windows-31J name and instead calls that variation "shift_jis". IBM's code page 943 includes the same double-byte codes as Microsoft's code page 932, while IBM's code page 932 includes fewer extensions (excluding those which Microsoft incorporates from NEC), and retains the character order from the 1978 edition of JIS X 0208, rather than implementing the character variant swaps from the 1983 standard. Windows-31J assigns 0x5C to U+005C REVERSE SOLIDUS (the backslash ), and 0x7E to U+007E TILDE , following US-ASCII . However, most localised fonts on Windows display U+005C as
SECTION 20
#1733092845444660-615: The characters map to Unicode characters beyond the BMP. The space with lead bytes 0xF5 to 0xF9 (beyond the region used for JIS X 0208) is used by Japanese mobile phone operators for pictographs for use in E-mail . KDDI goes further and defines hundreds more in the space with lead bytes 0xF3 and 0xF4. Beyond even this, there have been numerous minor variations made on Shift JIS, with individual characters here and there altered. Most of these extensions and variants have no IANA registration, so there
690-500: The double-byte characters are "shifted" around the 64 halfwidth katakana characters in the single-byte range 0xA1 to 0xDF . The single-byte characters 0x 00 to 0x7F match the ASCII encoding, except for a yen sign (U+00A5) at 0x5C and an overline (U+203E) at 0x7E in place of the ASCII character set's backslash and tilde respectively (these deviations from ASCII align with JIS X 0201 ). The single-byte characters from 0xA1 to 0xDF map to
720-424: The first byte of two-byte characters will be high-bit-set (0x80–0xFF); the value of the second byte can be either high or low. The appearance of byte values 0x40–0x7E as second bytes of code words makes reliable Shift JIS detection difficult, because the same codes are used for ASCII characters. Since the same byte value can be either first or second byte, string searches are difficult, since simple searches can match
750-424: The following method of mapping codepoints. In the above, s 1 s 2 {\displaystyle s_{1}s_{2}} is a two-byte Shift_JIS-2004 sequence, m {\displaystyle m} is the plane ( 面 , men , surface) number (1 or 2), k {\displaystyle k} is the row ( 区 , ku , ward) number (1-94) and t {\displaystyle t}
780-417: The half-width katakana characters found in JIS X 0201 . For double-byte characters, the first byte is always in the range 0x81 to 0x9F or the range 0xE0 to 0xEF (these ranges are unassigned in JIS X 0201 ). If the first byte is odd, the second byte must be in the range 0x40 to 0x9E (but cannot be 0x7F); if the first byte is even, the second byte must in the range 0x9F to 0xFC. Shift JIS only guarantees that
810-559: The same code page identifier. This is a common way that CDRA denotes an upgraded CCSID. There are a few reasons for this complexity: Shift-JIS Shift JIS (also SJIS , MIME name Shift_JIS , known as PCK in Solaris contexts) is a character encoding for the Japanese language , originally developed by the Japanese company ASCII Corporation in conjunction with Microsoft and standardized as JIS X 0208 Appendix 1 . Shift JIS
840-489: The second byte of a character and the first byte of the next, which is not a valid Shift JIS character. String-searching algorithms must be tailor-made for Shift JIS . Shift JIS is fully backwards compatible with the JIS X 0201 single-byte encoding , meaning that any valid JIS X 0201 string is also a valid Shift JIS string. Double-byte characters in JIS X 0208 need to be transformed in order to be encoded in Shift JIS. For
870-445: The vertical presentation forms 10 rows down, and also used the NEC extension layout for row 13. The newer JIS X 0213 standard defines an extended variant of Shift_JIS referred to as Shift_JISx0213 (in a previous version of the standard) or Shift_JIS-2004 . It is a superset of standard Shift JIS. In order to represent the allocated rows on both planes of JIS X 0213, Shift_JIS-2004 uses
900-462: The width of a DBCS character on a fixed-width computer terminal or text screen . Though single-byte character sets have largely been supplanted by UTF-8 and its variants on modern systems, they have found a niche in code golfing , where the smaller byte size of characters allows participants to gain an edge if they use SBCSs with specially-designed programming languages such as Vyxal and GolfScript . This character encoding article
#443556