♦ Processing large files efficiently in Java – part 1

Q1. What are the key considerations in processing large files?
A1. Before jumping into coding, get the requirements.

#1. Processing a file involves reading from the disk, processing (e.g. parsing an XML and transforming), and writing back to the disk. It is also a trade off in terms of what is more important to you like having better I/O, better CPU usage, and better memory usage. It is important to conduct profiling to monitor CPU usage, memory usage, and I/O efficiency.

1) Reading the data from the disk can be I/O-heavy.
2) Storing the read data in the Java heap memory to process them can be memory-heavy.
3) Parsing & transforming the data can be CPU-heavy.
4) Writing the processed data back to the disk can be I/O-heavy.

CPU bound & I/O bound are two opposites. CPU bound means the program is bottle-necked by CPU. I/O (i.e. Input/Output) bound means the program is bottle-necked by reading from or writing to a disk or network.

#2. File types to process. Only ASCII, only binary, or both ASCII and binary.

If you need to handle splittable ASCII files like comma delimited or tab delimited text files, you could write a simple bash script that divide files into smaller pieces and read them as usual.

But if you have a requirement to handle binary formats like PDFs, then splitting the files approach won’t work. If it is an XML file, then favor streaming using a parser like StAX (Streaming API for XML). StAX can be used for reading & writing with good CPU & memory efficiency.

When working with BigData, the use of XML File and JSON File formats is a common mistake as they are not splittable. BigData supports container file formats like Sequence Files, AVRO, Parquet, ORC, etc. Hadoop file formats and how to choose. These file formats are splittable & compressible to take advantage of the distributed computing. You can give each split chunk to an executor to process.

#3. In order to better utilize I/O, CPU, and memory, you will have to read, parse, and write in chunks. You need to process regions of data incrementally using memory mapped files. The good thing about the memory mapped files is that they do not consume virtual memory or paging space since it is backed by file data on disk. But, you can get OutOfMemory errors for very large files. For example, spring-batch framework allows you to read, process, and write data in chunks. If your processing requires talking to many systems via different protocols like ftp, http, etc then make use of the spring integration with spring-batch.

Apache Spark use RDDs (i.e. Resilient Distributed Datasets). RDDs are split into partitions to be processed and written in parallel. These partitions are logical chunks of data comprised of records. Inside a partition, data is processed sequentially. You can control the number of partitions of a RDD using repartition or coalesce transformations.

#4. You can also introduce multi-threading with a pool of finite number of threads to improve CPU and I/O efficiency at the cost of memory. Measure the performance of a single-threaded job to see if it meets your need before adding complexities with multi-threading.

Example 1:

“You can create a big byte buffer, run several threads that read bytes from a file into that buffer in parallel, when ready find first end of line, make a string object, find next, and repeat these sequences.

Again, it is a trade-off, and profile and tune your application in terms of thread pool size, allocated heap memory, garbage collection algorithms, etc.

Example 2:

Spring-batch allows you to write multi-threaded-steps. Spring also supports “remote partitioning” and “remote chunking“.

Remote Partitioning is a master/slave step configuration that allows for partitions of data to be processed in parallel. For example, if you were processing a database table, partition 1 may be account ids 0-100, partition 2 being account ids 101-200, etc.

Remote chunking is also a master/slave configuration, but the data is read by the master and sent over the wire to the slave for processing.

Q2. What are the different ways to read a small data file in Java?
A2. There are many ways. Here are some examples and timings using a 5MB file, 250MB file, and a 1GB file.

1. Read a 5MB file line by line with a scanner class

Total elapsed time: 271 ms. Light on memory usage, but heavy on I/O.

2. Reading a 5MB file with Java NIO using memory mapped files

Total elapsed time: 43 ms. Efficient on I/O. More work is required to process the buffer.

3. Reading a 5MB file line by line with Java 8 Stream

Total elapsed time: 160 ms.

4. Reading a 5MB file with Java 7 Files & Paths classes

Total elapsed time: 150 ms.

Reading a 250.0MB file

1. Scanner approach: Total elapsed time: 7062 ms

2. Maped Byte Buffer: Total elpased time: 1220 ms

3. Java 8 Stream: Total elapsed time: 1024 ms

4. Java 7 Files: Total elapsed time: 3400 ms

Reading a 1.0GB file

1. Sanner approach: Total elapsed time: 15627 ms

2. Maped Byte Buffer: Exception in thread “main” java.lang.OutOfMemoryError: Java heap space

3. Java 8 Stream: Total elapsed time: 3124 ms

4. Java 7 Files: Total elapsed time: 13657 ms

The Approach #2 OutOfMemoryError was due to loading the whole file into memory with ” buffer.load();”. This can be fixed. Let’s revise the code with “buffer.get()“.

Total elapsed time: 460 ms.

Conclusion – MappedByteBuffer wins for file sizes up to 1 GB

Java nio MappedByteBuffer performs the best for large files, and followed by Java 8. Monitor your application to see if it is more I/O bound, memory bound, or CPU bound.

When reading a file larger than 1.0 GB into memory

You can get “OutOfMemoryError“s. You need to stream your reading & writing to prevent “OutOfMemoryError”s. For example, using

#1. Apache commons IO library’s LineIterator

To read from “System.in”

#2. java.util.Scanner streaming

#3. Java 8 onward has a reader.lines()

#4. JAXB with StAX for streaming large XMLs

1) JAXB with StAX Tutorial step by step for unmarshalling.

2) JAXB with StAX Tutorial step by step for marshalling

Q3. What are the different data sizes, and what technologies can be used to process them?
A3. In general, data sizes can be classified as shown below.

1) Small size data is < 10 GB in multiple files. It fits in a single machine’s memory when you process them by streaming to conserve memory. Java’s file processing APIs, Apache commons File APIs, Spring batch framework or Java EE 7 batch processing framework can be used.

2) Medium size data is 10 GB to 1 TB in multiple files. Fits in a single machine’s disk space. Process them by splitting or streaming as you won’t be able read all the contents into memory. Spring batch framework or Java EE 7 batch processing framework can be used.

3) Big data is > 1 TB in multiple files. Stored across multiple machines and processed in distributed fashion. E.g. run a map reduce or a Spark job.

Processing medium size data? Look at Spring-batch. Spring batch industrial strength tutorial – part1

Processing big data? Look at Apache Spark – parallel processing. Apache Spark interview questions & answers | Apache Spark Tutorials

Reading a file ~2GB into memory and its limitations

String & Array limitations: Java String array limitations and OutOfMemoryError.

What is next?

In the next post will introduce file parsing and multi-threading to improve efficiency.

overall bird-eye view of the Java Environment together with so many useful questions. I am totally impressed with the way you have learned Java.

Yours Sincerely
Ye Tun Oo ( More )

Arulkumaran Kumaraswamipillai

Mechanical Engineer to freelance Java developer within 3 years. Freelancing since 2003 for the major banks, telecoms, retail & government organizations. Attended 150+ Java job interviews, and most often got 3-6 job offers to choose from. Published Java/JEE books via Amazon.com in 2005, and sold 35K+ copies. Books are outdated and replaced with this online Java training. join my LinkedIn group.

Posted in IO, Memory Management, Performance

800+ Java Interview Q&As – ♥Free | ♦FAQs

open all | close all

Pressed for time? 200+ Java Interview FAQs

open all | close all

16 Technical Key Areas to be a top-notch

open all | close all

100+ Java Tutorials – Step by step

open all | close all

100+ Java Coding Exercises

open all | close all

How good are your

open all | close all