Processing large files efficiently in Java – part 1

Q1. What are the key considerations in processing large files?
A1. Before jumping into coding, get the requirements.

#1. Processing a file involves reading from the disk, processing (e.g. parsing an XML and transforming), and writing back to the disk. It is also a trade off in terms of what is more important to you like having better I/O, better CPU usage, and better memory usage. It is important to conduct profiling to monitor CPU usage, memory usage, and I/O efficiency.

1) Reading the data from the disk can be I/O-heavy.
2) Storing the read data in the Java heap memory to process them can be memory-heavy.
3) Parsing & transforming the data can be CPU-heavy.
4) Writing the processed data back to the disk can be I/O-heavy.

#2. File types to process. Only ASCII, only binary, or both ASCII and binary.

If you need to handle splittable ASCII files like comma delimited or tab delimited text files, you could write a simple bash script that divide files into smaller pieces and read them as usual.

But if you have a requirement to handle binary formats like PDFs, then splitting the files approach won’t work. If it is an XML file, then favor streaming using a parser like StAX (Streaming API for XML). StAX can be used for reading & writing with good CPU & memory efficiency.

When working with BigData, the use of XML File and JSON File formats is a common mistake as they are not splittable. BigData supports container file formats like Sequence Files, AVRO, Parquet, ORC, etc. Hadoop file formats and how to choose. These file formats are splittable & compressible to take advantage of the distributed computing. You can give each split chunk to an executor to process.

#3. In order to better utilize I/O, CPU, and memory, you will have to read, parse, and write in chunks. You need to process regions of data incrementally using memory mapped files. The good thing about the memory mapped files is that they do not consume virtual memory or paging space since it is backed by file data on disk. But, you can get OutOfMemory errors for very large files.

#4. You can also introduce multi-threading with a pool of finite number of threads to improve CPU and I/O efficiency at the cost of memory. For example:

“You can create a big byte buffer, run several threads that read bytes from a file into that buffer in parallel, when ready find first end of line, make a string object, find next, and repeat these sequences.

Again, it is a trade-off, and profile and tune your application in terms of thread pool size, allocated heap memory, garbage collection algorithms, etc.

Q2. What are the different ways to read a file in Java?
A2. There are many ways. Here are some examples and timings using a 5MB file, 250MB file, and a 1GB file.

1. Read a 5MB file line by line with a scanner class

Total elapsed time: 271 ms. Light on memory usage, but heavy on I/O.

2. Reading a 5MB file with Java NIO using memory mapped files

Total elapsed time: 43 ms. Efficient on I/O. More work is required to process the buffer.

3. Reading a 5MB file line by line with Java 8 Stream

Total elapsed time: 160 ms.

4. Reading a 5MB file with Java 7 Files & Paths classes

Total elapsed time: 150 ms.

Reading a 250.0MB file

1. Scanner approach: Total elapsed time: 7062 ms

2. Maped Byte Buffer: Total elpased time: 1220 ms

3. Java 8 Stream: Total elapsed time: 1024 ms

4. Java 7 Files: Total elapsed time: 3400 ms

Reading a 1.0GB file

1. Sanner approach: Total elapsed time: 15627 ms

2. Maped Byte Buffer: Exception in thread “main” java.lang.OutOfMemoryError: Java heap space

3. Java 8 Stream: Total elapsed time: 3124 ms

4. Java 7 Files: Total elapsed time: 13657 ms

The Approach #2 OutOfMemoryError was due to loading the whole file into memory with ” buffer.load();”. This can be fixed. Let’s revise the code with “buffer.get()“.

Total elapsed time: 460 ms.

Conclusion – MappedByteBuffer wins for file sizes up to 1 GB

Java nio MappedByteBuffer performs the best for large files, and followed by Java 8. Monitor your application to see if it is more I/O bound, memory bound, or CPU bound.

When reading a file larger than 1.0 GB into memory

You can get “OutOfMemoryError“s. You need to stream your reading & writing to prevent “OutOfMemoryError”s. For example, using

#1. Apache commons IO library’s LineIterator

To read from “System.in”

#2. java.util.Scanner streaming

#3. Java 8 onwards has reader.lines()

#4. JAXB with StAX for streaming XML

1) JAXB with StAX Tutorial step by step for unmarshalling.

2) JAXB with StAX Tutorial step by step for marshalling

Also, need to understand the “String & Array limitations“. Java String & Array limitations and OutOfMemoryError.

What is the BigData file size?

1) Small size data is < 10 GB. It fits in a single machine’s memory when you process them by streaming to conserve memory.

2) Medium size data is 10 GB to 1 TB. Fits in a single machine’s disk space. Process them by splitting or streaming as you won’t be able read all the contents into memory.

3) Big data is > 1 TB. Stored across multiple machines and processed in distributed fashion. E.g. run a map reduce or a Spark job.

What is next?

In the next post will introduce file parsing and multi-threading to improve efficiency.

Print Friendly
The following two tabs change content below.
Arulkumaran Kumaraswamipillai
Mechanical Engineering to Java freelancer since 2003. Published Java/JEE books via Amazon.com in 2005, and sold 35K+ copies. Books are outdated and replaced with this online Java training. join my LinkedIn group.
Arulkumaran Kumaraswamipillai

Mechanical Engineering to Java freelancer since 2003. Published Java/JEE books via Amazon.com in 2005, and sold 35K+ copies. Books are outdated and replaced with this online Java training. join my LinkedIn group.

Posted in IO, Memory Management, Performance
Tags:

800+ Interview Q&As with lots of diagrams & code ♥Free | ♦FAQs | Hover/Slide for full text

open all | close all

200+ Java Interview FAQs – Quick Prep

open all | close all

16 Java Key Areas to be a top-notch

open all | close all

80+ Java Tutorials – Step by step

open all | close all

100+ Java Coding Exercises

open all | close all

How good are your

open all | close all