Unix – Convert csv to psv without removing commas within quotes

Unix is a very powerful tool to automate monotonous & manual tasks. Here is a real-life example for data engineers & software engineers. If you are not familiar with Shell scripting the code may look complex, but if you dissect as shown below it becomes easier.

Problem statement: Given the below sample input file where there can be a “,” in the content itself. How will you convert the comma separated value (i.e csv) file into a pipe separated value (i.e psv) file?

file_with_comma_on_data.csv

As you can see when there is a “,” in the field data itself it is enclosed within double quotes.

awk command to the rescue

-v option in awk is to enable you to pass variables into your awk script. A bit like passing arguments to a unix script.

OFS is a predefined output field separator variable, which is a double quote (i.e “). You can pass other variables to awk program as in -v “INPUT_DATE=$SUPPLIED_DATE”.

-F'”‘ means ” is the field separator. The line

Will be split into below 3 values using ” as the delimiter.

The line 2

will be split into as shown below.

The line 3

will be split into as shown below.

How did I get the above split?

You can either split manually on the delimiter ” or via the below code. Note presence of blanks.

‘{for (i=1; i<=NF; i+=1) print $i}' is used to loop through each field where it gets stored into the variable $i, which will be $1, $2, $3, etc, and print the variable $1, $2, etc.

NF is a predefined variable whose value is the number of fields in the current record.

‘{ for (i=1; i<=NF; i+=2) gsub(",", "|", $i) } 1!=0 {print}' loops through first, third, fifth, etc columns to replace every comma with a |. The reason for every odd numbered field is that they are the delimiter commas. The commas in the even numbered fields are left as they are because they are the commas in the content itself. What is this 1!=0 {print}?

An awk program is a series of condition-action pairs, conditions being outside of curly braces and actions being enclosed in them. A condition is considered false if it evaluates to zero or the empty string, anything else is true.

The above can be rewritten in short as:

‘{ for (i=1; i<=NF; i+=2) gsub(",", "|", $i) } 1' where The 1 at the end of your script is a condition (always true) with no action, so it executes the default action for every line, printing the line.

Output psv file

The output will be:

What if you want a space next to each pipe?

What if you want to display in columnar format?

Complete code

Outputs:

Q. How will you loop through a folder say “test3”, which has multiple csv files & output the number of columns available in each file?

A. The below code does the trick.

d -> captures the file name.

NF -> is a predefined variable whose value is the number of fields in the current record

Output:


Java developer & architect Q&As

Java developers Q&As

Top