Unix – combining two or more files & output only a subset of columns

Unix is a very powerful tool to automate monotonous & manual tasks. Here is a real-life example for data engineers & software engineers. If you are not familiar with Shell scripting the code may look complex, but if you dissect as shown below it becomes easier.

What does the below Unix code do?

Unix combine data

Unix combine data

Step 1: Unzip only CSV files

As depicted above, the input files are zip files containing the actual data file as in “my_file_2021_08_01.csv” & a transmission excel file as in “my_trx_2021_08_01.xlsx”. When unzipping, “-j” to not create any directory and the “.xlsx” files are excluded with the “-x *.xlsx” within the for loop as shown below. The resulting .csv files are stored in a sub directory “./my_dir” with the “-d” option.

Step 2: Combine CSV files into a single file

NR – gives the total number of records processed.
FNR – gives the total number of records for each input file.

In awk, FNR refers to the record number (typically the line number) in the current file and NR refers to the total record number across all files. The operator == is a comparison operator, which returns true when the two surrounding operands are equal. So,

(NR == 1) || (FNR > 1) means copy the first row of header only once from the first file, and then the data. The header is ignored in the subsequent files.

Step 3: Pick a few columns

NF – to limit the output to the actual number of fields for the record.
FPAT – to handle an embedded comma in a quoted field like “TV, Radio Campaign”.

FS – is the field separator.
OFS – is output field separator

Instead of “FS” “FPAT” is used with regular expression to break the fields. This is because there could be “,” in the content as shown below with “TV, Radio Campaign”.

The regex explanation:

The output of the awk command:

The unified file can be used via Excel with pivot tables for further analysis or ingested into a database table for further queries via SQL.


300+ Java & Big Data Interview FAQs

16+ Java Key Areas Interview Q&As

800+ Java Interview Q&As

300+ Java & Big Data Tutorials

Top