Misplaced Pages

Double-precision floating-point format

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.

Double-precision floating-point format (sometimes called FP64 or float64 ) is a floating-point number format , usually occupying 64 bits in computer memory; it represents a wide range of numeric values by using a floating radix point .

#303696

55-468: Double precision may be chosen when the range or precision of single precision would be insufficient. In the IEEE 754 standard , the 64-bit base-2 format is officially referred to as binary64 ; it was called double in IEEE 754-1985 . IEEE 754 specifies additional floating-point formats, including 32-bit base-2 single precision and, more recently, base-10 representations ( decimal floating point ). One of

110-488: A biased exponent of 0, but are interpreted with the value of the smallest allowed exponent, which is one greater (i.e., as if it were encoded as a 1). In decimal interchange formats they require no special encoding because the format supports unnormalized numbers directly. Mathematically speaking, the normalized floating-point numbers of a given sign are roughly logarithmically spaced, and as such any finite-sized normal float cannot include zero . The subnormal floats are

165-453: A 52-bit fraction is or Between 2=4,503,599,627,370,496 and 2=9,007,199,254,740,992 the representable numbers are exactly the integers. For the next range, from 2 to 2, everything is multiplied by 2, so the representable numbers are the even ones, etc. Conversely, for the previous range from 2 to 2, the spacing is 0.5, etc. The spacing as a fraction of the numbers in the range from 2 to 2 is 2. The maximum relative rounding error when rounding

220-436: A decimal string with the same number of digits, the final result should match the original string. If an IEEE 754 double-precision number is converted to a decimal string with at least 17 significant digits, and then converted back to double-precision representation, the final result must match the original number. The format is written with the significand having an implicit integer bit of value 1 (except for special data, see

275-477: A finite digit binary fraction. For example, decimal 0.1 cannot be represented in binary exactly, only approximated. Therefore: Since IEEE 754 binary32 format requires real values to be represented in ( 1. x 1 x 2 . . . x 23 ) 2 × 2 e {\displaystyle (1.x_{1}x_{2}...x_{23})_{2}\times 2^{e}} format (see Normalized number , Denormalized number ), 1100.011

330-473: A given 32-bit binary32 data with a given sign , biased exponent e (the 8-bit unsigned integer), and a 23-bit fraction is which yields In this example: thus: Note: The single-precision binary floating-point exponent is encoded using an offset-binary representation, with the zero offset being 127; also known as exponent bias in the IEEE 754 standard. Thus, in order to get the true exponent as defined by

385-405: A linearly spaced set of values, which span the gap between the negative and positive normal floats. Subnormal numbers provide the guarantee that addition and subtraction of floating-point numbers never underflows; two nearby floating-point numbers always have a representable non-zero difference. Without gradual underflow, the subtraction a  −  b can underflow and produce zero even though

440-517: A maximum value of (2 − 2 ) × 2 ≈ 3.4028235 × 10 . All integers with seven or fewer decimal digits, and any 2 for a whole number −149 ≤ n ≤ 127, can be converted exactly into an IEEE 754 single-precision floating-point value. In the IEEE 754 standard , the 32-bit base-2 format is officially referred to as binary32 ; it was called single in IEEE 754-1985 . IEEE 754 specifies additional floating-point types, such as 64-bit base-2 double precision and, more recently, base-10 representations. One of

495-564: A modifier strictfp was introduced to enforce strict IEEE 754 computations. Strict floating point has been restored in Java ;17. As specified by the ECMAScript standard, all arithmetic in JavaScript shall be done using double-precision floating-point arithmetic. The JSON data encoding format supports numeric values, and the grammar to which numeric expressions must conform has no limits on

550-438: A number to the nearest representable one (the machine epsilon ) is therefore 2. The 11 bit width of the exponent allows the representation of numbers between 10 and 10, with full 15–17 decimal digits precision. By compromising precision, the subnormal representation allows even smaller values up to about 5 × 10. The double-precision binary floating-point exponent is encoded using an offset-binary representation, with

605-439: A practical implementation. Some implementations of floating-point units do not directly support subnormal numbers in hardware, but rather trap to some kind of software support. While this may be transparent to the user, it can result in calculations that produce or consume subnormal numbers being much slower than similar calculations on normal numbers. In IEEE binary floating point formats , subnormals are represented by having

SECTION 10

#1733202198304

660-408: A signal so quiet that it is out of the human hearing range. Because of this, a common measure to avoid subnormals on processors where there would be a performance penalty is to cut the signal to zero once it reaches subnormal levels or mix in an extremely quiet noise signal. Other methods of preventing subnormal numbers include adding a DC offset, quantizing numbers, adding a Nyquist signal, etc. Since

715-503: A significand with a leading digit of zero. Of these, the subnormal numbers represent values which if normalized would have exponents below the smallest representable exponent (the exponent having a limited range). The significand (or mantissa) of an IEEE floating-point number is the part of a floating-point number that represents the significant digits . For a positive normalised number it can be represented as m 0 . m 1 m 2 m 3 ... m p −2 m p −1 (where m represents

770-408: A significant decrease in performance. When subnormal values are entirely computed in hardware, implementation techniques exist to allow their processing at speeds comparable to normal numbers. However, the speed of computation remains significantly reduced on many modern x86 processors; in extreme cases, instructions involving subnormal operands may take as many as 100 additional clock cycles, causing

825-472: A significant digit, and p is the precision) with non-zero m 0 . Notice that for a binary radix , the leading binary digit is always 1. In a subnormal number, since the exponent is the least that it can be, zero is the leading significant digit (0. m 1 m 2 m 3 ... m p −2 m p −1 ), allowing the representation of numbers closer to zero than the smallest normal number. A floating-point number may be recognized as subnormal whenever its exponent

880-493: A value of 0.375. We saw that 0.375 = ( 0.011 ) 2 = ( 1.1 ) 2 × 2 − 2 {\displaystyle 0.375={(0.011)_{2}}={(1.1)_{2}}\times 2^{-2}} Hence after determining a representation of 0.375 as ( 1.1 ) 2 × 2 − 2 {\displaystyle {(1.1)_{2}}\times 2^{-2}} we can proceed as above: From these we can form

935-421: A value of 127 represents the actual exponent zero. Exponents range from −126 to +127 (thus 1 to 254 in the exponent field), because the biased exponent values 0 (all 0s) and 255 (all 1s) are reserved for special numbers ( subnormal numbers , signed zeros , infinities , and NaNs ). The true significand of normal numbers includes 23 fraction bits to the right of the binary point and an implicit leading bit (to

990-499: A zero exponent field with a non-zero significand field. No other denormalized numbers exist in the IEEE binary floating point formats, but they do exist in some other formats, including the IEEE decimal floating point formats. Some systems handle subnormal values in hardware, in the same way as normal values. Others leave the handling of subnormal values to system software ("assist"), only handling normal values and zero in hardware. Handling subnormal values in software always leads to

1045-438: Is 2 − 149 ≈ 1.4 × 10 − 45 {\displaystyle 2^{-149}\approx 1.4\times 10^{-45}} . In general, refer to the IEEE 754 standard itself for the strict conversion (including the rounding behaviour) of a real number into its equivalent binary32 format. Here we can show how to convert a base-10 real number into an IEEE 754 binary32 format using

1100-463: Is a computer number format , usually occupying 32 bits in computer memory ; it represents a wide dynamic range of numeric values by using a floating radix point . A floating-point variable can represent a wider range of numbers than a fixed-point variable of the same bit width at the cost of precision. A signed 32-bit integer variable has a maximum value of 2 − 1 = 2,147,483,647, whereas an IEEE 754 32-bit base-2 floating-point variable has

1155-467: Is a commonly used format on PCs, due to its wider range over single-precision floating point, in spite of its performance and bandwidth cost. It is commonly known simply as double . The IEEE 754 standard specifies a binary64 as having: The sign bit determines the sign of the number (including when this number is zero, which is signed ). The exponent field is an 11-bit unsigned integer from 0 to 2047, in biased form : an exponent value of 1023 represents

SECTION 20

#1733202198304

1210-499: Is analogous): The default denormalization behavior is mandated by the ABI , and therefore well-behaved software should save and restore the denormalization mode before returning to the caller or calling code in other libraries. AArch32 NEON (SIMD) FPU always uses a flush-to-zero mode , which is the same as FTZ + DAZ . For the scalar FPU and in the AArch64 SIMD, the flush-to-zero behavior

1265-423: Is defined as normal , and others are defined as subnormal , denormal , or unnormal by their relationship to normal . In a normal floating-point value, there are no leading zeros in the significand (also commonly called mantissa); rather, leading zeros are removed by adjusting the exponent (for example, the number 0.0123 would be written as 1.23 × 10 ). Conversely, a denormalized floating point value has

1320-433: Is known to work on Mac OS X since at least 2006. For other x86-SSE platforms where the C library has not yet implemented this flag, the following may work: The _MM_SET_DENORMALS_ZERO_MODE and _MM_SET_FLUSH_ZERO_MODE macros wrap a more readable interface for the code above. Most compilers will already provide the previous macro by default, otherwise the following code snippet can be used (the definition for FTZ

1375-495: Is more than 1/2 of a unit in the last place . Subnormal number In computer science , subnormal numbers are the subset of denormalized numbers (sometimes called denormals ) that fill the underflow gap around zero in floating-point arithmetic . Any non-zero number with magnitude smaller than the smallest positive normal number is subnormal , while denormal can also refer to numbers outside that range. In some older documents (especially standards documents such as

1430-438: Is shifted to the right by 3 digits to become ( 1.100011 ) 2 × 2 3 {\displaystyle (1.100011)_{2}\times 2^{3}} Finally we can see that: ( 12.375 ) 10 = ( 1.100011 ) 2 × 2 3 {\displaystyle (12.375)_{10}=(1.100011)_{2}\times 2^{3}} From which we deduce: From these we can form

1485-682: Is termed REAL in Fortran ; SINGLE-FLOAT in Common Lisp ; float in C , C++ , C# and Java ; Float in Haskell and Swift ; and Single in Object Pascal ( Delphi ), Visual Basic , and MATLAB . However, float in Python , Ruby , PHP , and OCaml and single in versions of Octave before 3.2 refer to double-precision numbers. In most implementations of PostScript , and some embedded systems ,

1540-595: Is the least value possible. By filling the underflow gap like this, significant digits are lost, but not as abruptly as when using the flush to zero on underflow approach (discarding all significant digits when underflow is reached). Hence the production of a subnormal number is sometimes called gradual underflow because it allows a calculation to lose precision slowly when the result is small. In IEEE 754-2008 , denormal numbers are renamed subnormal numbers and are supported in both binary and decimal formats. In binary interchange formats, subnormal numbers are encoded with

1595-426: Is the sign bit, x is the exponent, and m is the significand. These examples are given in bit representation , in hexadecimal and binary , of the floating-point value. This includes the sign, (biased) exponent, and significand. By default, 1/3 rounds up, instead of down like double precision , because of the even number of bits in the significand. The bits of 1/3 beyond the rounding point are 1010... which

1650-419: The double type corresponds to double precision. However, on 32-bit x86 with extended precision by default, some compilers may not conform to the C standard or the arithmetic may suffer from double rounding . Fortran provides several integer and real types, and the 64-bit type real64 , accessible via Fortran's intrinsic module iso_fortran_env , corresponds to double precision. Common Lisp provides

1705-543: The SSE2 processor extension, Intel has provided such a functionality in CPU hardware, which rounds subnormal numbers to zero. Intel's C and Fortran compilers enable the DAZ (denormals-are-zero) and FTZ (flush-to-zero) flags for SSE by default for optimization levels higher than -O0 . The effect of DAZ is to treat subnormal input arguments to floating-point operations as zero, and

Double-precision floating-point format - Misplaced Pages Continue

1760-432: The actual zero. Exponents range from −1022 to +1023 because exponents of −1023 (all 0s) and +1024 (all 1s) are reserved for special numbers. The 53-bit significand precision gives from 15 to 17 significant decimal digits precision (2 ≈ 1.11 × 10). If a decimal string with at most 15 significant digits is converted to the IEEE 754 double-precision format, giving a normal number, and then converted back to

1815-537: The double-precision number is described by: Encodings of qNaN and sNaN are not completely specified in IEEE 754 and depend on the processor. Most processors, such as the x86 family and the ARM family processors, use the most significant bit of the significand field to indicate a quiet NaN; this is what is recommended by IEEE 754. The PA-RISC processors use the bit to indicate a signaling NaN. By default, / 3 rounds down, instead of up like single precision , because of

1870-421: The effect of FTZ is to return zero instead of a subnormal float for operations that would result in a subnormal float, even if the input arguments are not themselves subnormal. clang and gcc have varying default states depending on platform and optimization level. A non- C99 -compliant method of enabling the DAZ and FTZ flags on targets supporting SSE is given below, but is not widely supported. It

1925-415: The exponent encoding below). With the 52 bits of the fraction (F) significand appearing in the memory format, the total precision is therefore 53 bits (approximately 16 decimal digits, 53 log 10 (2) ≈ 15.955). The bits are laid out as follows: [REDACTED] The real value assumed by a given 64-bit double-precision datum with a given biased exponent e {\displaystyle e} and

1980-515: The fastest instructions to run as much as six times slower. This speed difference can be a security risk. Researchers showed that it provides a timing side channel that allows a malicious web site to extract page content from another site inside a web browser. Some applications need to contain code to avoid subnormal numbers, either to maintain accuracy, or in order to avoid the performance penalty in some processors. For instance, in audio processing applications, subnormal values usually represent

2035-450: The first programming languages to provide floating-point data types was Fortran . Before the widespread adoption of IEEE 754-1985, the representation and properties of floating-point data types depended on the computer manufacturer and computer model, and upon decisions made by programming-language implementers. E.g., GW-BASIC 's double-precision data type was the 64-bit MBF floating-point format. Double-precision binary floating-point

2090-454: The first programming languages to provide single- and double-precision floating-point data types was Fortran . Before the widespread adoption of IEEE 754-1985, the representation and properties of floating-point data types depended on the computer manufacturer and computer model, and upon decisions made by programming-language designers. E.g., GW-BASIC 's single-precision data type was the 32-bit MBF floating-point format. Single precision

2145-637: The following outline: Conversion of the fractional part: Consider 0.375, the fractional part of 12.375. To convert it into a binary fraction, multiply the fraction by 2, take the integer part and repeat with the new fraction by 2 until a fraction of zero is found or until the precision limit is reached which is 23 fraction digits for IEEE 754 binary32 format. We see that ( 0.375 ) 10 {\displaystyle (0.375)_{10}} can be exactly represented in binary as ( 0.011 ) 2 {\displaystyle (0.011)_{2}} . Not all decimal fractions can be represented in

2200-446: The following. On processors with only dynamic precision, such as x86 without SSE2 (or when SSE2 is not used, for compatibility purpose) and with extended precision used by default, software may have difficulties to fulfill some requirements. C and C++ offer a wide variety of arithmetic types . Double precision is not required by the standards (except by the optional annex F of C99 , covering IEEE 754 arithmetic), but on most systems,

2255-418: The implicit 24th bit), bit 23 to bit 0, represents a value, starting at 1 and halves for each bit, as follows: The significand in this example has three bits set: bit 23, bit 22, and bit 19. We can now decode the significand by adding the values represented by these bits. Then we need to multiply with the base, 2, to the power of the exponent, to get the final result: Thus This is equivalent to: where s

Double-precision floating-point format - Misplaced Pages Continue

2310-532: The initial releases of IEEE 754 and the C language ), "denormal" is used to refer exclusively to subnormal numbers. This usage persists in various standards documents, especially when discussing hardware that is incapable of representing any other denormalized numbers, but the discussion here uses the term "subnormal" in line with the 2008 revision of IEEE 754 . In casual discussions the terms subnormal and denormal are often used interchangeably, in part because there are no denormalized IEEE binary numbers outside

2365-513: The left of the binary point) with value 1. Subnormal numbers and zeros (which are the floating-point numbers smaller in magnitude than the least positive normal number) are represented with the biased exponent value 0, giving the implicit leading bit the value 0. Thus only 23 fraction bits of the significand appear in the memory format, but the total precision is 24 bits (equivalent to log 10 (2 ) ≈ 7.225 decimal digits). The bits are laid out as follows: [REDACTED] The real value assumed by

2420-741: The odd number of bits in the significand. In more detail: Using double-precision floating-point variables is usually slower than working with their single precision counterparts. One area of computing where this is a particular issue is parallel code running on GPUs. For example, when using NVIDIA 's CUDA platform, calculations with double precision can take, depending on hardware, from 2 to 32 times as long to complete compared to those done using single precision . Additionally, many mathematical functions (e.g., sin, cos, atan2, log, exp and sqrt) need more computations to give accurate double-precision results, and are therefore slower. Doubles are implemented in many programming languages in different ways such as

2475-425: The offset-binary representation, the offset of 127 has to be subtracted from the stored exponent. The stored exponents 00 H and FF H are interpreted specially. The minimum positive normal value is 2 − 126 ≈ 1.18 × 10 − 38 {\displaystyle 2^{-126}\approx 1.18\times 10^{-38}} and the minimum positive (subnormal) value

2530-402: The only supported precision is single. The IEEE 754 standard specifies a binary32 as having: This gives from 6 to 9 significant decimal digits precision. If a decimal string with at most 6 significant digits is converted to the IEEE 754 single-precision format, giving a normal number , and then converted back to a decimal string with the same number of digits, the final result should match

2585-415: The original string. If an IEEE 754 single-precision number is converted to a decimal string with at least 9 significant digits, and then converted back to single-precision representation, the final result must match the original number. The sign bit determines the sign of the number, which is the sign of the significand as well. The exponent field is an 8-bit unsigned integer from 0 to 255, in biased form :

2640-452: The precision or range of the numbers so encoded. However, RFC 8259 advises that, since IEEE 754 binary64 numbers are widely implemented, good interoperability can be achieved by implementations processing JSON if they expect no more precision or range than binary64 offers. Rust and Zig have the f64 data type. Single-precision floating-point format Single-precision floating-point format (sometimes called FP32 or float32 )

2695-800: The resulting 32-bit IEEE 754 binary32 format representation of 12.375: Note: consider converting 68.123 into IEEE 754 binary32 format: Using the above procedure you expect to get ( 42883EF9 ) 16 {\displaystyle ({\text{42883EF9}})_{16}} with the last 4 bits being 1001. However, due to the default rounding behaviour of IEEE 754 format, what you get is ( 42883EFA ) 16 {\displaystyle ({\text{42883EFA}})_{16}} , whose last 4 bits are 1010. Example 1: Consider decimal 1. We can see that: ( 1 ) 10 = ( 1.0 ) 2 × 2 0 {\displaystyle (1)_{10}=(1.0)_{2}\times 2^{0}} From which we deduce: From these we can form

2750-424: The resulting 32-bit IEEE 754 binary32 format representation of real number 0.375: If the binary32 value, 41C80000 in this example, is in hexadecimal we first convert it to binary: then we break it down into three parts: sign bit, exponent, and significand. We then add the implicit 24th bit to the significand: and decode the exponent value by subtracting 127: Each of the 24 bits of the significand (including

2805-473: The resulting 32-bit IEEE 754 binary32 format representation of real number 1: Example 2: Consider a value 0.25. We can see that: ( 0.25 ) 10 = ( 1.0 ) 2 × 2 − 2 {\displaystyle (0.25)_{10}=(1.0)_{2}\times 2^{-2}} From which we deduce: From these we can form the resulting 32-bit IEEE 754 binary32 format representation of real number 0.25: Example 3: Consider

SECTION 50

#1733202198304

2860-431: The subnormal range. The term "number" is used rather loosely, to describe a particular sequence of digits, rather than a mathematical abstraction; see Floating Point for details of how real numbers relate to floating point representations. "Representation" rather than "number" may be used when clarity is required. Mathematical real numbers may be approximated by multiple floating point representations. One representation

2915-691: The types SHORT-FLOAT, SINGLE-FLOAT, DOUBLE-FLOAT and LONG-FLOAT. Most implementations provide SINGLE-FLOATs and DOUBLE-FLOATs with the other types appropriate synonyms. Common Lisp provides exceptions for catching floating-point underflows and overflows, and the inexact floating-point exception, as per IEEE 754. No infinities and NaNs are described in the ANSI standard, however, several implementations do provide these as extensions. On Java before version 1.2, every implementation had to be IEEE 754 compliant. Version 1.2 allowed implementations to bring extra precision in intermediate computations for platforms like x87 . Thus

2970-518: The values are not equal. This can, in turn, lead to division by zero errors that cannot occur when gradual underflow is used. Subnormal numbers were implemented in the Intel 8087 while the IEEE 754 standard was being written. They were by far the most controversial feature in the K-C-S format proposal that was eventually adopted, but this implementation demonstrated that subnormal numbers could be supported in

3025-424: The zero offset being 1023; also known as exponent bias in the IEEE 754 standard. Examples of such representations would be: The exponents 000 16 and 7ff 16 have a special meaning: where F is the fractional part of the significand . All bit patterns are valid encoding. Except for the above exceptions, the entire double-precision number is described by: In the case of subnormal numbers ( e = 0)

#303696