Java floating-point number intricacies – J010
DeegeU Java Course
This video “Java floating-point number intricacies” is part of a larger free online class called “Free Java Course Online”. You can find more information about this class on “Free Java Course Online” syllabus.
Try at home
- Type in a program with different floating point number types
- Try setting your floats to NaN or Infinity
- Try calculating the bit representation for different numbers
- Try creating errors in your program and compiling it
Transcript – Java floating-point number intricacies
So what happens when we need to represent numbers with fractions? So far we’ve covered integer types only. For real numbers we have two data types for fractional numbers, and again in different sizes. The two data types are float and double. Float is short for floating point, and double just means double sized. These primitives are used to represent real numbers.
In this lesson we’ll look at the Java float and Java double. These primitives are really the IEEE 753 floating point numbers. We’ll look at how they are stored differently. And finally we’ll look at why they are not the best choice for fractional decimal numbers.
Before we dive into the floating point numbers, let’s review. We can represent numbers as binary numbers. They follow the same pattern as decimal numbers. Instead of being to the power 10, it’s to the power of 2. So instead of 1, 10, 100, 1000, etc, each bit is 1, 2, 4, 8, 16, 32, and so on. We covered this in the lesson how do computers store numbers. Binary numbers work the same way for fractions.
Just as point 1 is really 1 over 10, the first fractional bit for a binary number is 1 over 2. Then it’s 1 over 4, 1 over 8 and so on. So if we took 1.75, in binary it would be 1.11. The first digit is 1 over 2 and the second is 1 over 4. One over 2, plus one over 4, is 3 over 4, or .75.
If you remember scientific notation, that works too. Take the decimal number 101.3, that’s 1.013 x 10^2. So if we had the binary number 101.11, that is represented as 1.0111 x 2^2. You’re probably asking what does this mean for floating point numbers?
The float can represent numbers from well… this one… to that one. The numbers can be positive or negative. The bits are ordered a different way than they were for integers. Its got three parts, the sign, the exponent and the mantissa. The first bit says if the number is positive or negative. That’s the sign bit. Positive or negative. The mantissa is our number in 23 bits, normalized as a scientific notation number. The exponent is 8 bits, and it’s like the exponent in our binary scientific notation, but the exponent is biased.
The word biased is a engineering thing. What it means in this case is we’re shifting the value by 127. So if the value in the exponent is 3, it really means is 127 + 3, or 130. Why would you want to do that? Well we want negative exponents too. This way we can represent fractional numbers with no whole part.
A couple of other numbers you can represent. If everything is zero, then the number is zero. The sign bit doesn’t really matter. If the sign bit is set, but everything else is a zero, that’s still zero. Sometimes called negative zero.
If all the exponent bits are set, but the mantissa bits are all zero, that’s a special number. That’s infinity. The sign bit tells us if its positive or negative infinity.
If all the exponent bits are set, and at least one mantissa bit is set, that’s another special number called a NaN. It stands for “not a number”. You can set your floating point number to a number, zero, infinity, or not a number. It doesn’t matter which mantissa bit or how many are set.
So lets represent 16.125 using a float. A float is a 32 bit number. We can represent 16 with 10000. The fractional part is .001. That makes the binary number 10000.001. To represent that in our scientific notation, we have 1.0000001 x 2^4.
Now we have to place this all in our floating point bits. The sign is positive, so thats 0. So 0 goes into bit 32. The exponent is 4. We need to bias the number, so that makes our exponent 127 + 4, or 131. We store 131 into bits 24-31. That is 10000011.
The last bit is the mantissa. Our scientific binary number is 1.000001. We know the first digit is a one, so we don’t store that. We just store the 000001 for the mantissa. The rest of the digits are zero. So our final number is 01000001100000010000000000000000. That’s what gets stored into a float.
A double is just a bigger float. It uses 1 bit for the sign, 11 bits for the exponent, and 53 bits for the the mantissa. It gives you more precision, and bigger values.
Those of you paying attention at home are thinking 1 + 11 + 53 is 65. It’s a 64 bit number. What gives? If you remember, the first digit of your mantissa is a 1, and we never store it. It’s an implied bit. That’s why we use the scientific notation. That way we know the first bit is always a 1, and we get it for free.
Lets create some floats and doubles. To create the types in Java, we use the same pattern we did for the integer types. We give the type, either float or double, a name for our number, and possibly a value. When defining a float, you need to add the letter f to the end of your literal. That tells Java it’s a float. If you leave it off, Java thinks your literal is a double. You can add d to the end of your double literals for clarity, but the default is double.
Luckily we don’t have to do all the bit manipulation to create our numbers. We can assign our float as a decimal literal, either as a floating point number, or as a number in scientific notation. You can use a lowercase or uppercase E. In this particular case,
Java is case insensitive.
We can run the app, and it prints the numbers back out.
Let’s create that error. We’ll try creating a float, then assigning a number without the f at the end. We get the lossy conversion error again. Most of the time, you’ll work with doubles. There is little reason to use floats these days.
You can also set the numbers to the special numbers we talked about before using the Float and Double number classes. We’ll cover the classes later, but here’s how to set a number as infinity. That can also be positive infinity. Here’s what we’d do to set it
as a nan.
You probably noticed if some of the digits are used to represent exponents, they can not be used to represent a particular individual number like in integers. This means you have “holes” in your number line. It is not possible to represent all numbers. Some floating point numbers just can’t be represented in a float or double. If you can’t represent it, it gets rounded to the closest number it can represent.
This problem is even worse when you consider, some simple numbers cannot be represented at all. For example take the number 1 over 3, or one third. When you try to represent this as a fraction, it will go on forever. This is what we call a repeating decimal. You can have the same problem for binary numbers.
Take 1 over 10, or one tenth. The decimal number is 0.1. Try this in binary. Take 10 in binary, and divide 1. You’ll find that you’ll get a repeating pattern in binary, and the number will never end. This is something to remember, especially if you’re trying to represent money. You can’t represent $.10.
That sounds crazy, so lets see this in action. Pretend we’re adding ten cents with twenty cents. That should give us thirty cents. We’re going to format the print statement, because the normal print statement will round our results. We’ll print out to 16 decimal places for the float and 17 for the double. Run the program, and we get thirty cents, plus some random digits way out there. The more you add, the more that error will move to the right.
You should never use a primitive floating point number for money. So why use them at all if they are so error prone?
The floating point primitive types are fast. You are exchanging accuracy for speed. Since floats and doubles fit into 32 and 64 bits, they are optimized for operations on your computer. There are instructions on your CPU to handle these data types in few clock cycles.
You often see these types used for graphics or games. Most of the math routines in Java use doubles, so if you’re not constrained on memory, always prefer the double. You get a bit more accuracy, and the performance penalty doesn’t really show on modern computers.
So that’s the float and double. There’s a much better, but slower, thing to use if you want more accuracy called the BigDecimal. We’ll cover that when we get to number classes. Up next, we’ll look at boolean and char primitives.
<!-- DeegeU - Right Side -->
<ins class="adsbygoogle" style="display:inline-block;width:336px;height:280px" data-ad-client="ca-pub-5305511207032009" data-ad-slot="5596823779"></ins>
<script>
(adsbygoogle = window.adsbygoogle || []).push({});
</script></p>
Tools Used
- Java
- NetBeans
Media Credits
All media created and owned by DJ Spiess unless listed below.
- No infringement intended
Music: Riding the Tundra
http://www.purple-planet.com
Licensed under Creative Commons: By Attribution 3.0
http://creativecommons.org/licenses/by/3.0/
Get the code
The source code for “Are you ready to tackle the fizzbuzz test in Java?” can be found on Github. If you have Git installed on your system, you can clone the repository by issuing the following command:
git clone https://github.com/deege/deegeu-java-intro.git
Go to the Support > Getting the Code page for more help.
If you find any errors in the code, feel free to let me know or issue a pull request in Git.
Comments
Comments
DJ Spiess
Your personal instructor
My name is DJ Spiess and I’m a developer with a Masters degree in Computer Science working in Colorado, USA. I primarily work with Java server applications. I started programming as a kid in the 1980s, and I’ve programmed professionally since 1996. My main focus are REST APIs, large-scale data, and mobile development. The last six years I’ve worked on large National Science Foundation projects. You can read more about my development experience on my LinkedIn account.