Misplaced Pages

Talk:Floating-point arithmetic

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.

This is an old revision of this page, as edited by Ideogram (talk | contribs) at 23:06, 27 February 2012 (assess). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Revision as of 23:06, 27 February 2012 by Ideogram (talk | contribs) (assess)(diff) ← Previous revision | Latest revision (diff) | Newer revision → (diff)
This is the talk page for discussing improvements to the Floating-point arithmetic article.
This is not a forum for general discussion of the article's subject.
Article policies
Find sources: Google (books · news · scholar · free images · WP refs· FENS · JSTOR · TWL
Archives: 1, 2, 3, 4, 5Auto-archiving period: 3 months 
WikiProject iconComputing: CompSci Start‑class Top‑importance
WikiProject iconThis article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Misplaced Pages. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.ComputingWikipedia:WikiProject ComputingTemplate:WikiProject ComputingComputing
StartThis article has been rated as Start-class on Misplaced Pages's content assessment scale.
TopThis article has been rated as Top-importance on the project's importance scale.
Taskforce icon
This article is supported by WikiProject Computer science (assessed as Top-importance).
Things you can help WikiProject Computer science with:

Here are some tasks awaiting attention:
Archiving icon
Archives
Archive 1Archive 2Archive 3
Archive 4Archive 5


This page has archives. Sections older than 90 days may be automatically archived by Lowercase sigmabot III when more than 4 sections are present.

Software of book side of history

I just reverted a bit about the Pilot Ace in history because it used software to emulate floating point. However it occrs to me that there might be something worthwhile in the bit about J.H.Wilkinson, Rounding errors in algebraic processes. Is there evidence about who wrote a book about floating point or that this was a particular turning point? Dmcq (talk) 18:46, 6 February 2012 (UTC)

IEEE 754

I have added a section discussing the "big picture" on the rationale and use for the IEEE 754 features which often gets lost when discussing the details. I plan to add specific references for the points made there (from Kahn's web site). It would be good to expand the examples and add additional ones as well.

Brianbjparker (talk) —Preceding undated comment added 11:22, 19 February 2012 (UTC).

You need to cite something saying these were accepted rationales for it. Citations point to specific books journals or newspapers and preferably page number ranges. Dmcq (talk) 13:51, 19 February 2012 (UTC)

Added direct citations as requested. Brianbjparker (talk) —Preceding undated comment added 18:20, 19 February 2012 (UTC).

Thanks. My feeling about Kahan and his diatribe against Java is that he just doesn't get what programmers have to do when testing a program. Having a switch to enable lax typing of intermediate results where you know it ill only be run in environments you've tested is a good idea but that wasn't what Java was originally designed for. The section about extended precision there seems undue in length as I'm pretty certain other considerations like signed zero and denormal handling were the main original considerations where it differed from previous implementations. Dmcq (talk) 20:37, 19 February 2012 (UTC)
Although I referenced Kahan's Java paper several times, I certainly didn't want this section to appear as a slight against Java. Kahan has several other papers discussing the need for extended precision that do not mention Java-- I will replace the current references with those in the near future, and try to trim it down (although I don't think that that reference is a diatribe against Java, just against its numerics). I certainly didn't want to get into the tradeoffs between improved numerical precision of results versus exact reproducibility in Java in this section. I do however think that it is important to clarify the intended use of the IEEE754 features in an introductory article like this, which can get lost in detailed descriptions of the features. In particular, I find that there is *wide* misunderstanding of the intended use of, and need for, extended precision amongst the programming community, particularly as extended precision was historically not supported in several RISC processors, and thus it is underused by programmers, even when targeting the x86 platform for e.g. HPC (even when these same programmers would carry additional significant figures for intermediate calculations if doing the same computations by hand, as alluded to in this section). Also, Kahan's descriptions of work on the design of the x87 (based on his experience designing HP calculators which use extended precision internally) makes it clear that extended precision was intended as a key feature (indeed a recommended feature) of IEEE754, compared with previous implementations.

Brianbjparker (talk) 00:56, 20 February 2012 (UTC)

As far as I'm aware the main other rationales were
To have a sound mathematical basis in that results were correctly rounded versions of accurate results and also so reasoning about the calculations would be easier.
Round to even was used to improve accuracy. In fact this is much more important than extended precision if the double storage mode is only used for intermediate calculations. Using extended precision only gives bout one extra bit overall at the end if values in arrays are in doubles. The main reason I believe they were put in was it made calculating mathematical functions much easier and more accurate, they can also be used in inner routines with benefit.
Biased rounding was put in I believe to support interval arithmetic - another part of being able to guarantee the results of calculations. Dmcq (talk) 15:43, 20 February 2012 (UTC)
Using extended precision only gives bout one extra bit overall at the end if values in arrays are in doubles. This is false in general; you must be thinking of some special cases where not many intermediate calculations happen before rounding to double for storage. For a counterexample, e.g. consider a loop to take a dot product of two double-precision arrays (not using Kahan summation etc.) — Steven G. Johnson (talk) 21:16, 20 February 2012 (UTC)
You would normally get very little advantage in that case over round to even with so few intermediate calculations. And for longer calculations round to even wins over just using a longer mantissa and rounding down. You only get a worthwhile gain if the storage is in extended precision. Dmcq (talk) 21:53, 20 February 2012 (UTC)
That is certainly not the case in general. The examples you are thinking of are using simple exactly rounded single arithmetic expresions-- the advantage of extended precision is avoiding loss of precision in more complicated numerically unstable formulae-- e.g. it is easy to construct examples were even computing a quadratic formula discriminant can cause massive loss of ULP when computed in double but not in double extended. Several examples are given in the Kahan references. This is in addition to the advantage of the extended exponent in avoiding overflow in e.g. dot products. Brianbjparker (talk) 00:16, 22 February 2012 (UTC)
When you say Round to even was used to improve accuracy., I take it you are mainly referring to the exact rounding: breaking ties by round to even does avoid some additional statistic biases but it is rather subtle (might be worth mentioning the main text though..). Brianbjparker (talk) 00:16, 22 February 2012 (UTC)
Biased rounding was put in I believe to support interval arithmetic. Yes, I believe directed rounding was included to support interval arithmetic, but also for debugging numerical stability issues-- if an algorithm gives drastically different results under round to + and - infinity then it is likely unstable. Brianbjparker (talk) 00:16, 22 February 2012 (UTC)
As far as I'm aware the main other rationales were... to have a sound mathematical basis in that results were correctly rounded versions of accurate results and also so reasoning about the calculations would be easier.. Yes, the exact rounding is an important point-- I have added some additional text earlier in the article to expand on this. It is true that, like previous arithmetics, having a precise specification to allow expert numerical analysts to write robust libraries was an important consideration, but the unique aspect of IEEE-754 is that it was also aimed at a broad market of non-expert users and so I focused in the section on the robustness features relevant to that (I will add some text highlighting that aspect as well though). Brianbjparker (talk) 00:16, 22 February 2012 (UTC)
Well exact rounding, but I thought it better to specify the precise format they have. The point is that rounding rather than truncating is what really matters. With rounding the error only tends to go up with the number of computations as the square root of the number of operations whereas with directed rounding it goes up linearly. Even the reduction of bias by round to even matter in this. You alwayts get something else putting in a little bias so it is not as good as this but directed rounding is really bad. You're better off just perturbing the original figures for stability checking.
The mathematical basis makes it much easier to do things like construct longer precision arithmetic packages easily, in fact the fused multiply is particularly useful for this. Dmcq (talk) 00:27, 22 February 2012 (UTC)
The use of directed rounding for diagnosis of stability issues is discussed here http://www.cs.berkeley.edu/~wkahan/Stnfrd50.pdf and in other references at that web site. It also discusses why perturbation alone is not as useful. IEEE 754-2008 annex B states this explicitly-- "B.2 Numerical sensitivity: Debuggers should be able to alter the attributes governing handling of rounding or exceptions inside subprograms, even if the source code for those subprograms is not available; dynamic modes might be used for this purpose. For instance, changing the rounding direction or precision during execution might help identify subprograms that are unusually sensitive to rounding, whether due to ill-condition of the problem being solved, instability in the algorithm chosen, or an algorithm designed to work in only one rounding- direction attribute. The ultimate goal is to determine responsibility for numerical misbehavior, especially in separately-compiled subprograms. The chosen means to achieve this ultimate goal is to facilitate the production of small reproducible test cases that elicit unexpected behavior." Brianbjparker (talk) 01:04, 22 February 2012 (UTC)
The uses that somebody makes of features is quite a different thing from the rationale for why somebody would pay to have them implemented. The introduction to the standard gives a succinct summary of the main reasons for the standard. I'll just copy the latest here so you can see
a) Facilitate movement of existing programs from diverse computers to those that adhere to this standard as well as among those that adhere to this standard.
b) Enhance the capabilities and safety available to users and programmers who, although not expert in numerical methods, might well be attempting to produce numerically sophisticated programs.
c) Encourage experts to develop and distribute robust and efficient numerical programs that are portable, by way of minor editing and recompilation, onto any computer that conforms to this standard and possesses adequate capacity. Together with language controls it should be possible to write programs that produce identical results on all conforming systems.
d) Provide direct support for
― execution-time diagnosis of anomalies
― smoother handling of exceptions
― interval arithmetic at a reasonable cost.
e) Provide for development of
― standard elementary functions such as exp and cos
― high precision (multiword) arithmetic
― coupled numerical and symbolic algebraic computation.
f) Enable rather than preclude further refinements and extensions.
There are other things but this is what the basic rationale was and is. Directed rounding was for interval arithmetic. Dmcq (talk) 01:56, 22 February 2012 (UTC)
Thanks. Actually, I believe that "d) Provide direct support for― execution-time diagnosis of anomalies" is referring to this use of directed rounding to diagnose numerical instability. Certainly Kahan makes it clear that he considered it a key usage from the early design of the x87. I agree that its use for interval arithmetic was also considered from the beginning. Brianbjparker (talk) 02:11, 22 February 2012 (UTC)
No that refers to identification and methods of notifying the various exceptions and the handling of the signalling and quiet NaNs. Your reference from 2007 does not support in any way that arbitrarily jiggling the calculations using directed rounding was considered as a reason to include directed rounding in the specification. He'd have been just laughed at if he had justified spending money on the 8087 for such a purpose when there are easy ways of doing something like that without any hardware assistance. Dmcq (talk) 08:23, 22 February 2012 (UTC)

Trivia removed

I removed about that the full precision of extended precision is attained when extended precision is used. The point about the algorithm is it converges using the precision used. We don't need to put in the precisions of single double and extended precision versions of the algorithm. Dmcq (talk) 23:23, 23 February 2012 (UTC)

I disagree that it is trivia-- it is a good example to also illustrate the earlier discussions on the usage of extended precision. In any case, to make it easier to find for those who may be interested in the information: the footnote to the final example, giving the precision using double extended for internal calculations, is included here-
"As the recurrence is applied repeatedly, the accuracy improves at first, but then it deteriorates. It never gets better than about 8 digits, even though 53-bit arithmetic should be capable of about 16 digits of precision. When the second form of the recurrence is used, the value converges to 15 digits of precision. Footnote: if intermediate calculations are carried at a higher precision using double extended (x87 80 bit) format, it reaches 18 digits of precision, which is the full target double precision." Brianbjparker (talk) 23:37, 23 February 2012 (UTC)
It just has nothing to do with extended precision. The first algorithm would go wrong just as badly with extended precision and the second one behaves exactly like double. There is nothing of note here. Why should it have all the various precisons in? The same thing would happen with float or quad precision. All it says is that the precision for different orecisions is different. Also a double cannot hold 18 digits of precision, used as an intermediate for double you'd at most get one bit of precision extra. Dmcq (talk) 00:50, 25 February 2012 (UTC)
Agreed that the footnote does nothing to clarify the particular point being made by that example-- that wasn't the aim though. The intention was to also utilise the example to demonstrate the utility of computing intermediate values to higher precision than needed by the final destination format to limit the effects of round-off. In that sense it is an example for the earlier discussion on extended precision (and also the section of approaches to improve accuracy). Perhaps the text "Footnote: if intermediate calculations are carried at a higher precision using double extended (x87 80 bit) format, it reaches 18 digits of precision, which is the full target double precision (see discussion on extended precision above)." would be clearer. Agreed it is is not the most striking example of this, but still demonstrates the idea-- perhaps a separate, more striking and specific example would be preferable, I will see what I can find. Brianbjparker (talk) 04:52, 25 February 2012 (UTC)
It does not illustrate that. What give you the idea it does? If anything it is an argument against what was said before. Using extended precision in the intermediate calculation and storing back as double does not give increased precision in the final result. The 18 digits only applies to the extended precision, it does not apply to the double result. The 18 digits is not the target precision of a double. A double can only hold 15 digits accurately. There is no way to stick the extra precision of the extended precision into the target double. Dmcq (talk) 09:53, 25 February 2012 (UTC)
IEEE 754 double precision gives from 15 to 17 decimal digits of precision (17 digits if round-tripping from double to text back to double). When the example is computed with extended precision it gives 17 decimal digits of precision, so if the returned double was to be used for further computation it would have less roundoff error, in ULP (at least one extra decimal digit worth). Although, as you say, if the double result is printed to 15 decimal digits this extra precision will be lost. I agree that it is not a compelling example-- a better example could show a difference in many decimal significant digits due to internal extended precision. 121.45.205.130 (talk) 23:21, 25 February 2012 (UTC)
The 17 digits for a round trip is only needed to cope with making certain that rounding works okay. The actual precision is just less than 16 digits, about 15.95 if one cranks the figures. Printing has nothing to do with it. I was just talking about the 53 bits of precision information held within double precision format expressed as decimal digits. You can't shove any more information into the bits. The value there is about 1 ulp out and using extended precision would gain that back. This is what I was saying about extended precision being very useful for getting accurate maths functions, straightforward implementations in double will very often be 1 ulp out without special work whereas the extended precision result will very often give the value given by rounding the exact value. Dmcq (talk) 00:08, 26 February 2012 (UTC)
Ideally, what should be added is a more striking example of using excess precision in intermediate computations to protect against numerical instability. The current one can indeed demonstrate this if excess precision is carried to IEEE quad precision, in which case the numerical unstable version gives good results. I have added notes to that effect which will do as an example for now. There are many examples also showing this using only double extended (e.g. even as simple as computing the roots of a quadratic equation), and I will add such an example in the future.. but not for a while (by the way, I think double extended adds more than 1 ULP but I haven't checked that). Brianbjparker (talk) 06:54, 26 February 2012 (UTC)
That's not true either because how does one know when to stop? Using quadruple precision would still diverge. Dmcq (talk) 11:45, 26 February 2012 (UTC)
Yes that is so- once it does reach the correct value it stays there for several iterations (at double precision) but does eventually diverge from it again, so a stopping criterion of when the value does not change at double precision could be used. But yes, I am not completely happy with that example for that reason-- feel free to remove it if you feel it is misleading. Actually Kahan has several very compelling examples in his notes-- I will post one here in the next week or so. Brianbjparker (talk) 14:41, 26 February 2012 (UTC)

The use of extra precision can be illustrated easily using differentiation. If the result is to be single precision then using double precision for all the calculations is a good idea because of th loss of significance when subtracting two values of he function. Dmcq (talk) 12:00, 26 February 2012 (UTC)

ok yes, that could be a good example-- I will see what I can come up with. Brianbjparker (talk) 14:41, 26 February 2012 (UTC)

01010111 01101000 01100001 01110100 00101110 00101110 00101110 00111111 (What...?)

The section on internal representation does not explain how decimals are converted to floating-point values. I think it will be helpful if we add a step-by-step procedure that the computer follows. Thanks! 68.173.113.106 (talk) 02:16, 25 February 2012 (UTC)

This gives an example of conversion and the articles on the particular formats give other examples. Misplaced Pages does not in general provide step by step procedures, it describes things, see WP:NOTHOWTO. Dmcq (talk) 02:24, 25 February 2012 (UTC)
I just thought it was kind of unclear. Besides, doing so might actually help this article get to GA status.
You see, I'm trying to design an algorithm for getting the mantissa, the exponent, and the sign of a float or double. So in case anyone else actually cares about that stuff. For the record, the storage is little-endian, so you have to reverse the bit order. 68.173.113.106 (talk) 02:50, 25 February 2012 (UTC)
It would stop FA status. Have a look at the articles about the individual formats. They describe in quite enough details the format. Any particular algorithm is up to the user, they are not interesting or discussed in secondary sources. Dmcq (talk) 10:01, 25 February 2012 (UTC)
The closest in Misplaced Pages for the sort of stuff you're talking about is if somebody wrote something for wikibooks. Have you had a look at the various external sites? Really to me what you're talking about sounds like some homework exercise and we shouldn't help with those except perhaps to give hints. Dmcq (talk) 10:20, 25 February 2012 (UTC)

imho, "real numbers" is didactically misleading

I'd like to propose to change the beginning of the first sentence, because the limited amount of bits in the significand only allows for storing rational binary numbers. Because two is a prime factor of ten, this means only rational decimal numbers can be stored as well. Concluding, I'd like to propose to replace "real" by "rational" there. Drgst (talk) 13:17, 25 February 2012 (UTC)

Definitely not. That is a bad idea. They are approximations to real numbers. The concept of rational number just doesn't come into it. That they are rational is just a side effect. Dmcq (talk) 14:32, 25 February 2012 (UTC)
In the section 'Some other computer representations for non-integral numbers' there are some systems that can represent some irrational numbers. for instance a logarithmic system does not necessarily represent rational numbers. Dmcq (talk) 14:36, 25 February 2012 (UTC)
Categories: