Processing large files efficiently in Java – part 1

Q1. What are the key considerations in processing large files?
A1. Before jumping into coding, get the requirements.

#1. Processing a file involves reading from the disk, processing, and writing back to the disk. It is also a trade off in terms of what is more important to you like having better I/O, better CPU usage, and better memory usage. It is important to conduct profiling to monitor CPU usage, memory usage, and I/O efficiency.

1) Reading the data from the disk can be I/O-heavy,
2) Storing the read data in the Java heap can be memory-heavy,
3) Parsing the data can be CPU-heavy.
4) Writing the processed data back to the disk can be I/O-heavy.

#2. File types to process. Only ASCII, only binary, or both ASCII and binary. If you need to handle only ASCII files, you could write a simple bash script that divide files into smaller pieces and read them as usual.

But if you have a requirement to handle binary formats as well, then splitting files approach won’t work. If it is an XML file, then favor using an XML file parser like StAX (Streaming API for XML).

#3. In order to better utilize I/O, CPU, and memory, you will have to read, parse, and write in chunks. You need to process regions of data incrementally using memory mapped files. The good thing about the memory mapped files is that they do not consume virtual memory or paging space since it is backed by file data on disk. But, you can get OutOfMemory errors for very large files.

#4. You can also introduce multi-threading with a pool of finite number of threads to improve CPU and I/O efficiency at the cost of memory. For example:

“You can create a big byte buffer, run several threads that read bytes from a file into that buffer in parallel, when ready find first end of line, make a string object, find next, and repeat these sequences.

Again, it is a trade-off, and profile and tune your application in terms of thread pool size, allocated heap memory, garbage collection algorithms, etc.

Q2. What are the different ways to read a file in Java?
A2. There are many ways. Here are some examples and timings using a 5MB file, 250MB file, and a 1GB file.

1. Read a 5MB file line by line with a scanner class

Total elapsed time: 271 ms. Light on memory usage, but heavy on I/O.

2. Reading a 5MB file with Java NIO using memory mapped files

Total elapsed time: 43 ms. Efficient on I/O. More work is required to process the buffer.

3. Reading a 5MB file line by line with Java 8 Stream

Total elapsed time: 160 ms.

4. Reading a 5MB file with Java 7 Files & Paths classes

Total elapsed time: 150 ms.

Reading a 250.0MB file

1. Scanner approach: Total elapsed time: 7062 ms

2. Maped Byte Buffer: Total elpased time: 1220 ms

3. Java 8 Stream: Total elapsed time: 1024 ms

4. Java 7 Files: Total elapsed time: 3400 ms

Reading a 1.0GB file

1. Sanner approach: Total elapsed time: 15627 ms

2. Maped Byte Buffer: Exception in thread “main” java.lang.OutOfMemoryError: Java heap space

3. Java 8 Stream: Total elapsed time: 3124 ms

4. Java 7 Files: Total elapsed time: 13657 ms

The Approach #2 OutOfMemoryError was due to loading the whole file into memory with ” buffer.load();”. This can be fixed. Let’s revise the code with “buffer.get()“.

Total elapsed time: 460 ms.

Conclusion – MappedByteBuffer wins

Java nio MappedByteBuffer performs the best for large files, and followed by Java 8. Monitor your application to see if it is more I/O bound, memory bound, or CPU bound. In the next post will introduce file parsing and multi-threading to improve efficiency.

When reading a file larger than 1.0 GB into memory, you can get “OutOfMemoryError”s

You need to stream your reading & writing to prevent “OutOfMemoryError”s. Also, need to understand the “String & Array limitations”.

1) Java String & Array limitations and OutOfMemoryError.

2) JAXB with StAX Tutorial step by step for unmarshalling.

3) JAXB with StAX Tutorial step by step for marshalling

Print Friendly
The following two tabs change content below.
Arulkumaran Kumaraswamipillai
Mechanical Engineering to Java freelancer since 2003. Published Java/JEE books via Amazon.com in 2005, and sold 35K+ copies. Books are outdated and replaced with this online Java training.
Arulkumaran Kumaraswamipillai

Mechanical Engineering to Java freelancer since 2003. Published Java/JEE books via Amazon.com in 2005, and sold 35K+ copies. Books are outdated and replaced with this online Java training.

Posted in IO, Memory Management, Performance
Tags:
1100+ paid subscribers. Reviews | Free Contents. Monthly 260K+ views & 40k+ visitors. 9 tips to earn more.

Java Developer Training – 800+ Q&As ♥Free|♦FAQ (Mouse Hover for Tooltip)

open all | close all

200+ Java Developer Job Interview FAQs

open all | close all

16 Java Programmer Key Areas to be a top-notch

open all | close all

80+ Java Tutorials – Step by step

open all | close all

100+ Java Developer Coding Exercises

open all | close all

How good are your …..Java job hunting & career fast-tracking skills?

open all | close all