Code page 932 (Microsoft Windows)

Microsoft Windows code page 932 (abbreviated MS932 , Windows-932 or ambiguously CP932 ), also called Windows-31J amongst other names (see § Terminology below), is the Microsoft Windows code page for the Japanese language , which is an extended variant of the Shift JIS Japanese character encoding . It contains standard 7-bit ASCII codes, and Japanese characters are indicated by the high bit of the first byte being set to 1. Some code points in this page require a second byte, so characters use either 8 or 16 bits for encoding.

#249750

49-515: IBM offer the same extended double-byte codes in their code page 943 ( IBM-943 or CP943 ), which is a combination of the single-byte Code page 897 and the double-byte Code page 941 . Windows-31J is the most used non- UTF-8 /Unicode Japanese encoding on the web. However, many people and software packages, including Microsoft libraries, declare the Shift JIS encoding for Windows-31J data, although it includes some additional characters, and some of

98-755: A ROM chip that contained the font. The interface of those adapters (emulated by all later adapters such as VGA) was typically limited to single byte character sets with only 256 characters in each font/encoding (although VGA added partial support for slightly larger character sets). When dealing with older hardware, protocols and file formats, it is often necessary to support these code pages, but newer encoding systems, in particular Unicode, are encouraged for new designs. DOS code pages are typically stored in .CPI files. These code pages are used by IBM in its AIX operating system. They emulate several character sets, namely those ones designed to be used accordingly to ISO, such as UNIX-like operating systems. Code page 819

147-719: A Shift JIS variant, lacks the NEC and NEC-selected double-byte vendor extensions which are present in Microsoft's variant (although both include the IBM extensions) and preserves the 1978 ordering of JIS X 0208. IBM's code page 943 (or "IBM-943") includes the same double byte codes as Windows code page 932. Microsoft's version corresponds closely to the encoding referred to as ibm-943_P15A-2003 (with aliases including CP943C and Windows-932 ) in International Components for Unicode (ICU). There

196-647: A concern for Unicode. UTF-8 (which can encode over one million codepoints) has replaced the code-page method in terms of popularity on the Internet. When, early in the history of personal computers, users did not find their character encoding requirements met, private or local code pages were created using terminate-and-stay-resident utilities or by re-programming BIOS EPROMs . In some cases, unofficial code page numbers were invented (e.g. CP895). When more diverse character set support became available most of those code pages fell into disuse, with some exceptions such as

245-601: A few C0 control characters . IBM-943 , like IBM-932 , is a superset of the single-byte Code page 897 , which maps 0x5C to the Yen symbol ( ¥ ) and 0x7E to the overline ( ‾ ), this is followed by the encoding named "ibm-943_P130-1999" in ICU. Code page 897 (and therefore also IBM-943 and IBM-932) also adds single-byte box-drawing characters replacing certain C0 control characters , however these may still be treated as control characters depending on

294-503: A graphics mode and bypass this hardware limitation entirely. However the system of referring to character encodings by a code page number remains applicable, as an efficient alternative to string identifiers such as those specified by the IETF and IANA for use in various protocols such as e-mail and web pages. The majority of code pages in current use are supersets of ASCII , a 7-bit code representing 128 control codes and printable characters. In

343-453: A long time. Vendors that use a code page system allocate their own code page number to a character encoding, even if it is better known by another name; for example, UTF-8 has been assigned page numbers 1208 at IBM, 65001 at Microsoft, and 4110 at SAP. Hewlett-Packard uses a similar concept in its HP-UX operating system and its Printer Command Language (PCL) protocol for printers (either for HP printers or not). The terminology, however,

392-546: A program that does not use Unicode, the code page used for each string/document needs to be stored. Applications may also mislabel text in Windows-1252 as ISO-8859-1 . The only difference between these code pages is that the code point values in the range 0x80–0x9F, used by ISO-8859-1 for control characters, are instead used as additional printable characters in Windows-1252 ;– notably for quotation marks ,

441-461: A series of Symbol Sets (each with its associated Symbol Set Code) to encode either its own character sets or other vendors’ character sets. They are normally 7-bit character sets which, when moved to the higher part and associated with the ASCII character set, make up 8-bit character sets. These code pages are independent assignments by third party vendors. Since the original IBM PC code page ( number 437 )

490-412: A small, but globally unique, 16 bit number to each character encoding that a computer system or collection of computer systems might encounter. The IBM origin of the numbering scheme is reflected in the fact that the smallest (first) numbers are assigned to variations of IBM's EBCDIC encoding and slightly larger numbers refer to variations of IBM's extended ASCII encoding as used in its PC hardware. With

539-464: Is a superset of JIS X 0208 containing kanji sets level 1 to 3 and non-kanji characters such as Hiragana , Katakana (including letters used to write the Ainu language ), Latin, Greek and Cyrillic alphabets, digits, symbols and so on. Plane 2 contains only level 4 kanji set. Total number of the defined characters is 11,233. Each character is capable of being encoded in two bytes. This standard largely replaced

SECTION 10

#1732895326250

588-509: Is also a second ICU encoding named ibm-943_P130-1999 , which uses different single-byte mappings which more closely match IBM's code page definitions. (See § Single-byte character differences below for details.) Windows code page 932 is registered with the IANA as Windows-31J . The "Windows-31J" label is IANA's and not recognized by Microsoft, which has historically used "shift_jis" instead. The W3C / WHATWG encoding standard used by HTML5 treats

637-419: Is built around using an 8-bit code page, though it is possible to use two at once with some color depth sacrifice, and up to eight may be stored in the display adapter for easy switching. There was a selection of third-party code page fonts that could be loaded into such hardware. However, it is now commonplace for operating system vendors to provide their own character encoding and rendering systems that run in

686-438: Is different: What others call a character set , HP calls a symbol set , and what IBM or Microsoft call a code page , HP calls a symbol set code . HP developed a series of symbol sets, each with an associated symbol set code, to encode both its own character sets and other vendors’ character sets. The multitude of character sets leads many vendors to recommend Unicode . IBM introduced the concept of systematically assigning

735-617: Is identical to Latin-1, ISO/IEC 8859-1 , and with slightly-modified commands, permits MS-DOS machines to use that encoding. It was used with IBM AS/400 minicomputers. These code pages are used by IBM in its OS/2 operating system. These code pages are used by IBM when emulating the Microsoft Windows character sets. Most of these code pages have the same number as Microsoft code pages, although they are not exactly identical. Some code pages, though, are new from IBM, not devised by Microsoft. These code pages are used by IBM when emulating

784-486: Is not compatible with Windows-31J. In addition to the above, Microsoft uses different (but visually similar) Unicode mapping for several double-byte punctuation characters compared to standard Shift JIS, such as the wave dash being mapped to U+FF5E rather than U+301C, which is followed by ibm-943_P15A-2003 but not ibm-943_P130-1999, and using different mapping for the double byte backslash. Windows-932 includes standard 7-bit ASCII mappings for single-byte sequences with

833-451: Is officially reserved for user-definable code pages (or actually CCSIDs in the context of IBM CDRA ), whereas the range 65280-65533 ( FF00h - FFFDh ) is reserved for any user-definable "private use" assignments. For example, a non-registered custom variant of code page 437 ( 1B5h ) or 28591 ( 6FAF ) could become 57781 ( E1B5h ) or 61359 ( EFAFh ), respectively, in order to avoid potential conflicts with other assignments and maintain

882-508: Is one-way best-fit mapped onto 0x5C in Windows-932. However, code 0x5C in Windows-932 behaves as a reverse solidus (backslash) in all respects (e.g. in file paths on Windows systems) other than how it is displayed by some fonts, and Microsoft's documentation for Windows-932 displays 0x5C as a backslash. This mapping corresponds to the encoding named "ibm-943_P15A-2003" in International Components for Unicode (ICU), except for minor reordering of

931-531: The IBM Character Data Representation Architecture level 2 specifically reserves ranges of code page IDs for user-definable and private-use assignments. Whenever such code page IDs are used, the user must not assume that the same functionality and appearance can be reproduced in another system configuration or on another device or system unless the user takes care of this specifically. The code page range 57344-61439 ( E000h - EFFFh )

980-703: The Kamenický or KEYBCS2 encoding for the Czech and Slovak alphabets. Another character set is Iran System encoding standard that was created by Iran System corporation for Persian language support. This standard was in use in Iran in DOS-based programs and after introduction of Microsoft code page 1256 this standard became obsolete. However some Windows and DOS programs using this encoding are still in use and some Windows fonts with this encoding exist. In order to overcome such problems,

1029-462: The euro sign and the trademark symbol among others. Browsers on non-Windows platforms would tend to show empty boxes or question marks for these characters, making the text hard to read. Most browsers fixed this by ignoring the character set and interpreting as Windows-1252 to look acceptable. In HTML5, treating ISO-8859-1 as Windows-1252 is even codified as a W3C standard. Although browsers were typically programmed to deal with this behaviour, this

SECTION 20

#1732895326250

1078-448: The "OEM" and "Windows" code page for the applicable locale. These code pages are used by Microsoft in its MS-DOS operating system. Microsoft refers to these as the OEM code pages because they were defined by the original equipment manufacturers who licensed MS-DOS for distribution with their hardware, not by Microsoft or a standards organization. Most of these code pages have the same number as

1127-661: The Apple Macintosh character sets. These code pages are used by IBM when emulating the Adobe character sets. These code pages are used by IBM when emulating the HP character sets. These code pages are used by IBM when emulating the DEC character sets. These code pages are used by Microsoft in its own Windows operating system. Microsoft defined a number of code pages known as the ANSI code pages (as

1176-731: The IBM repertoire, but in a separate extension within the 94×94 JIS X 0208 grid (in rows 89–92, besides the characters already included in NEC row 13 ), rather than using Shift JIS codes beyond the JIS X 0208 range; Windows code page 932 includes these 388 characters in both locations. As a result, the because and not signs are encoded three times. Some of these representations were subsequently used for different characters by JIS X 0213 and Shift JIS-2004 . For example, compare row 89 in JIS X 0213 (beginning 硃, 硎, 硏…) to row 89 as used by JIS X 0208 with IBM/NEC extensions (beginning 纊, 褜, 鍈…). Consequently, Shift JIS-2004

1225-464: The NEC extensions or NEC selection. The IBM extensions were designed to encode characters from the IBM Japanese DBCS-Host repertoire which were initially absent in JIS X 0208; the 'because' sign ∵ and 'not' sign ￢ were later added to JIS X 0208 itself in 1983, and Microsoft includes them at extension locations as well as their 1983 locations. The NEC extensions also encode the entirety of

1274-413: The backslash by mapping the double byte 0x815F to U+FF3C FULLWIDTH REVERSE SOLIDUS, whereas standard Shift JIS maps it to U+005C. However, 0x5C in Windows-932 is nonetheless considered a Yen sign in certain contexts. For this reason, in many Japanese fonts, U+005C is displayed as a Yen symbol, which would normally be represented as U+00A5, rather than as a backslash per Unicode's suggested rendering. U+00A5

1323-575: The context, and are mapped to control characters in ICU. Code page In computing , a code page is a character encoding and as such it is a specific association of a set of printable characters and control characters with unique numbers. Typically each number represents the binary value in a single byte. (In some contexts these terms are used more precisely; see Character encoding § Terminology .) The term "code page" originated from IBM 's EBCDIC -based mainframe systems, but Microsoft , SAP , and Oracle Corporation are among

1372-530: The correct decoding algorithm when encountering binary stored data. These code pages are used by IBM in its EBCDIC character sets for mainframe computers . These code pages are used by IBM in its PC DOS operating system. These code pages were originally embedded directly in the text mode hardware of the graphic adapters used with the IBM PC and its clones, including the original MDA and CGA adapters whose character sets could only be changed by physically replacing

1421-413: The distant past, 8-bit implementations of the ASCII code set the top bit to zero or used it as a parity bit in network data transmissions. When the top bit was made available for representing character data, a total of 256 characters and control codes could be represented. Most vendors (including IBM) used this extended range to encode characters used by various languages and graphical elements that allowed

1470-482: The distinction is significant for computer programmers wishing to avoid mojibake . In addition to the standard JIS X 0201 :1997 and JIS X 0208 :1997 characters, Windows-31J includes several JIS X 0208 extensions, namely " NEC special characters (Row 13), NEC selection of IBM extensions (Rows 89 to 92), and IBM extensions (Rows 115 to 119)", in addition to setting some encoding space aside for end user definition . This also differs from IBM-932 , which does not include

1519-437: The equivalent IBM code pages, although some are not exactly identical. These code pages are used by Microsoft when emulating the Apple Macintosh character sets. The following code page numbers are specific to Microsoft Windows. IBM may use different numbers for these code pages. They emulate several character sets, namely those ones designed to be used accordingly to ISO, such as UNIX-like operating systems. HP developed

Code page 932 (Microsoft Windows) - Misplaced Pages Continue

1568-503: The existing characters are mapped to Unicode differently. This has led the WHATWG HTML standard to treat the encoding labels shift_jis and windows-31j interchangeably, and use the Windows variant for its "Shift_JIS" encoder and decoder. Microsoft's Shift JIS variant is known simply as "Code page 932" on Microsoft Windows, however this is ambiguous as IBM's code page 932 , while also

1617-657: The first one, 1252 was based on an apocryphal ANSI draft of what became ISO 8859-1 ). Code page 1252 is built on ISO 8859-1 but uses the range 0x80-0x9F for extra printable characters rather than the C1 control codes from ISO 6429 mentioned by ISO 8859-1. Some of the others are based in part on other parts of ISO 8859 but often rearranged to make them closer to 1252. Microsoft recommends new applications use UTF-8 or UCS-2/UTF-16 instead of these code pages. These code pages represent DBCS character encodings for various CJK languages. In Microsoft operating systems, these are used as both

1666-573: The high bit set to 0. Hence, codes 0x5C and 0x7E are mapped to Unicode as U+005C REVERSE SOLIDUS ( \ , the backslash ) and U+007E TILDE ( ~ ) respectively, as they are in ASCII ( ISO-646 -US). This is likewise done by the W3C/WHATWG encoding standard. By contrast, 0x5C is mapped to U+00A5 YEN SIGN ( ¥ ) in ISO-646-JP and consequently JIS X 0201 , of which standard Shift JIS is an extension. Correspondingly, Windows-31J avoids duplicate encoding of

1715-433: The imitation of primitive graphics on text-only output devices. No formal standard existed for these "extended ASCII character sets" and vendors referred to the variants as code pages, as IBM had always done for variants of EBCDIC encodings. Unicode is an effort to include all characters from all currently and historically used human languages into single character enumeration (effectively one large single code page), removing

1764-650: The installed code pages on any given Windows machine can be found in the Registry on that machine (this information is used by Microsoft programs such as Internet Explorer ). Most well-known code pages, excluding those for the CJK languages and Vietnamese , fit all their code-points into eight bits and do not involve anything more than mapping each code-point to a single character; furthermore, techniques such as combining characters, complex scripts, etc., are not involved. The text mode of standard ( VGA-compatible ) PC graphics hardware

1813-522: The label Shift_JIS (or sjis ) for JIS X 0208-defined Shift JIS, without recognising the Windows-31J label. In Japanese editions of Windows, this code page is referred to as "ANSI" , since it is the operating system's default 8-bit encoding, even though ANSI was not involved in its definition. Windows-31J is often mistaken for standard Shift JIS (as defined in JIS X 0208 :1997 Appendix 1): while similar,

1862-422: The label " shift_jis " interchangeably with "windows-31j" with the intent of being "compatible with deployed content" and matches Windows code page 932 (including the "formerly proprietary extensions from IBM and NEC"). Windows code page 932 is also called MS_Kanji , although IANA treat MS_Kanji as an alias for standard Shift JIS. Python , for example, uses the label MS-Kanji (or cp932 ) for Windows-932 and

1911-502: The list of assigned code page numbers independently from each other, resulting in some conflicting assignments. At least one third-party vendor ( Oracle ) also has its own different list of numeric assignments. IBM's current assignments are listed in their CCSID repository, while Microsoft's assignments are documented within the MSDN . Additionally, a list of the names and approximate IANA ( Internet Assigned Numbers Authority ) abbreviations for

1960-490: The meaning of all code point values in their code pages, which decreases the reliability of handling textual data consistently through various computer systems. Some vendors add proprietary extensions to established code pages, to add or change certain code point values: for example, byte 0x5C in Shift JIS can represent either a back slash or a yen sign depending on the platform. Finally, in order to support several languages in

2009-562: The need to distinguish between different code pages when handling digitally stored text. Unicode tries to retain backwards compatibility with many legacy code pages, copying some code pages 1:1 in the design process. An explicit design goal of Unicode was to allow round-trip conversion between all common legacy code pages, although this goal has not always been achieved. Some vendors, namely IBM and Microsoft, have anachronistically assigned code page numbers to Unicode encodings. This convention allows code page numbers to be used as metadata to identify

Code page 932 (Microsoft Windows) - Misplaced Pages Continue

2058-466: The private range like 65280 ( FF00h ). The code page IDs 0, 65534 ( FFFEh ) and 65535 ( FFFFh ) are reserved for internal use by operating systems such as DOS and must not be assigned to any specific code pages. JIS X 0213 JIS X 0213 is a Japanese Industrial Standard defining coded character sets for encoding the characters used in Japan. This standard extends JIS X 0208 . The first version

2107-498: The rarely used JIS X 0212 -1990 "supplementary" standard, which included 5,801 kanji and 266 non-kanji. Of the additional 3,695 kanji in JIS X 0213, all but 952 were already in JIS X 0212 . JIS X 0213 defines several 7-bit and 8-bit encodings including EUC-JIS-2004 , ISO-2022-JP-2004 and Shift JIS-2004 . Also, it defines the mapping from each of these encodings to ISO/IEC 10646 ( Unicode ) for each character. Unicode version 3.2 incorporated all characters of JIS X 0213 except for

2156-451: The release of PC DOS version 3.3 (and the near identical MS-DOS 3.3) IBM introduced the code page numbering system to regular PC users, as the code page numbers (and the phrase "code page") were used in new commands to allow the character encoding used by all parts of the OS to be set in a systematic way. After IBM and Microsoft ceased to cooperate in the 1990s, the two companies have maintained

2205-421: The sometimes existing internal numerical logic in the assignments of the original code pages. An unregistered private code page not based on an existing code page, a device specific code page like a printer font, which just needs a logical handle to become addressable for the system, a frequently changing download font, or a code page number with a symbolic meaning in the local environment could have an assignment in

2254-448: The vendors that use this term. The majority of vendors identify their own character sets by a name. In the case when there is a plethora of character sets (like in IBM), identifying character sets through a number is a convenient way to distinguish them. Originally, the code page numbers referred to the page numbers in the IBM standard character set manual, a condition which has not held for

2303-421: Was not always true of other software. Consequently, when receiving a file transfer from a Windows system, non-Windows platforms would either ignore these characters or treat them as a standard control characters and attempt to take the specified control action accordingly. Due to Unicode's extensive documentation, vast repertoire of characters and stability policy of characters, the problems listed above are rarely

2352-687: Was not really designed for international use, several partially compatible country or region specific variants emerged. These code pages number assignments are not official neither by IBM, neither by Microsoft and almost none of them is referred as a usable character set by IANA. The numbers assigned to these code pages are arbitrary and may clash to registered numbers in use by IBM or Microsoft. Some of them may predate codepage switching being added in DOS 3.3. List of known code page assignments (incomplete): Many older character encodings (unlike Unicode) suffer from several problems. Some vendors insufficiently document

2401-501: Was published in 2000 and revised in 2004 ( JIS2004 ) and 2012. As well as adding a number of special characters, characters with diacritic marks, etc., it included an additional 3,625 kanji. The full name of the standard is 7-bit and 8-bit double byte coded extended KANJI sets for information interchange ( 7ビット及び8ビットの2バイト情報交換用符号化拡張漢字集合 , Nana-Bitto Oyobi Hachi-Bitto no Ni-Baito Jōhō Kōkan'yō Fugōka Kakuchō Kanji Shūgō ) . JIS X 0213 has two "planes" (94×94 character tables). Plane 1

#249750