Introduction

Many beginners believe that data analysis requires installing Python, R, or other complex tools. Surprisingly, your computer already includes a powerful set of utilities capable of handling real statistical work—the Unix command line.

These built-in tools can process large files extremely quickly, automate repetitive tasks, and run on practically any Linux or macOS machine (and on Windows through WSL). No extra software is necessary.

In this walkthrough, you’ll learn how to perform core statistical operations directly in your terminal using standard Unix commands.

A full Bash script version of this guide is available on GitHub—following along while typing the commands will help you understand each step better.

Before you begin, make sure you have:

A Unix-based environment (Linux, macOS, or Windows with WSL).
A terminal window open.
No external libraries—everything here uses default system utilities.

Creating a Sample Dataset

Let’s start by generating a CSV file that represents daily website metrics. Run this block in your terminal:

cat > traffic.csv << EOF
date,visitors,page_views,bounce_rate
2024-01-01,1250,4500,45.2
2024-01-02,1180,4200,47.1
2024-01-03,1520,5800,42.3
2024-01-04,1430,5200,43.8
2024-01-05,980,3400,51.2
2024-01-06,1100,3900,48.5
2024-01-07,1680,6100,40.1
2024-01-08,1550,5600,41.9
2024-01-09,1420,5100,44.2
2024-01-10,1290,4700,46.3
EOF

This creates traffic.csv with ten days of sample data.

Exploring the Dataset

Counting Records

To see how many lines the file contains:

wc -l traffic.csv

This returns 11, meaning there are 10 data rows plus the header.

Previewing the File

View the first few lines with:

head -n 5 traffic.csv

You’ll see the header and the four rows beneath it.

Pulling Out a Single Column

To extract only the visitors column:

cut -d',' -f2 traffic.csv | tail -n +2

cut isolates column 2, and tail -n +2 removes the header.

Central Tendency: Mean, Median, Mode

Calculating the Mean (Average)

cut -d',' -f2 traffic.csv | tail -n +2 | \
awk '{sum+=$1; count++} END {print "Mean:", sum/count}'

Output:

Mean: 1340

Calculating the Median

cut -d',' -f2 traffic.csv | tail -n +2 | sort -n | \
awk '{arr[NR]=$1; count=NR} END {if(count%2==1) print "Median:", arr[(count+1)/2]; else print "Median:", (arr[count/2]+arr[count/2+1])/2}'

Output:

Median: 1355

Finding the Mode

cut -d',' -f2 traffic.csv | tail -n +2 | sort -n | uniq -c | \
sort -rn | head -n 1 | awk '{print "Mode:", $2, "(appears", $1, "times)"}'

This reports the most frequent value.

Understanding Spread: Variability in the Data

Maximum

awk -F',' 'NR>1 {if($2>max) max=$2} END {print "Maximum:", max}' traffic.csv

Minimum

awk -F',' 'NR==2 {min=$2} NR>2 {if($2<min) min=$2} END {print "Minimum:", min}' traffic.csv

Min and Max in One Pass

awk -F',' 'NR==2 {min=$2; max=$2} NR>2 {if($2<min) min=$2; if($2>max) max=$2} END {print "Min:", min, "Max:", max}' traffic.csv

Population Standard Deviation

awk -F',' 'NR>1 {sum+=$2; sumsq+=$2*$2; count++} \
END {mean=sum/count; print "Std Dev:", sqrt((sumsq/count)-(mean*mean))}' traffic.csv

Output:

Std Dev: 207.364

Sample Standard Deviation

awk -F',' 'NR>1 {sum+=$2; sumsq+=$2*$2; count++} \
END {mean=sum/count; print "Sample Std Dev:", sqrt((sumsq-(sum*sum/count))/(count-1))}' traffic.csv

Variance

awk -F',' 'NR>1 {sum+=$2; sumsq+=$2*$2; count++} \
END {mean=sum/count; var=(sumsq/count)-(mean*mean); print "Variance:", var}'

Percentiles and Quartiles

Quartiles (Q1, Q2, Q3)

cut -d',' -f2 traffic.csv | tail -n +2 | sort -n | awk '
{arr[NR]=$1; count=NR}
END {
  q1 = arr[int((count+1)/4)]
  q2 = (count%2==1) ? arr[int((count+1)/2)] : (arr[count/2] + arr[count/2+1]) / 2
  q3 = arr[int(3*(count+1)/4)]
  print "Q1:", q1
  print "Q2:", q2
  print "Q3:", q3
}'

Calculating Any Percentile

PERCENTILE=90
cut -d',' -f2 traffic.csv | tail -n +2 | sort -n | \
awk -v p=$PERCENTILE '
{arr[NR]=$1; count=NR}
END {
  pos = (count+1) * p/100
  idx = int(pos)
  frac = pos - idx
  if(idx >= count) print p "th percentile:", arr[count]
  else print p "th percentile:", arr[idx] + frac * (arr[idx+1] - arr[idx])
}'

Analyzing Multiple Columns at Once

Averaging Visitors, Page Views, and Bounce Rate

awk -F',' '
NR>1 {
  v_sum += $2
  pv_sum += $3
  br_sum += $4
  count++
}
END {
  print "Average visitors:", v_sum/count
  print "Average page views:", pv_sum/count
  print "Average bounce rate:", br_sum/count
}' traffic.csv

Correlation Between Visitors and Page Views

awk -F', *' '
NR>1 {
  x[NR-1] = $2
  y[NR-1] = $3

  sum_x += $2
  sum_y += $3
  count++
}
END {
  if (count < 2) exit

  mean_x = sum_x / count
  mean_y = sum_y / count

  for (i = 1; i <= count; i++) {
    dx = x[i] - mean_x
    dy = y[i] - mean_y

    cov   += dx * dy
    var_x += dx * dx
    var_y += dy * dy
  }

  sd_x = sqrt(var_x / count)
  sd_y = sqrt(var_y / count)

  print "Correlation:", (cov / count) / (sd_x * sd_y)
}' traffic.csv

This computes the Pearson correlation coefficient.

Conclusion

The Unix command line is far more capable than many new data scientists realize. With just a few built-in tools like awk, cut, and sort, you can calculate averages, variance, percentiles, correlations, and more—without relying on external libraries.

These techniques complement your Python or R workflow, offering a lightweight way to inspect, validate, and explore data.

Since every Unix-style environment includes these tools, you can practice and apply these skills anywhere.

Open your terminal and start experimenting—you’ll be surprised how powerful the command line can be.

Kickstart Your Data Skills: Perform Real Statistics Using Only the Command Line

Introduction

Creating a Sample Dataset

Exploring the Dataset

Counting Records

Previewing the File

Pulling Out a Single Column

Central Tendency: Mean, Median, Mode

Calculating the Mean (Average)

Calculating the Median

Finding the Mode

Understanding Spread: Variability in the Data

Maximum

Minimum

Min and Max in One Pass

Population Standard Deviation

Sample Standard Deviation

Variance

Percentiles and Quartiles

Quartiles (Q1, Q2, Q3)

Calculating Any Percentile

Analyzing Multiple Columns at Once

Averaging Visitors, Page Views, and Bounce Rate

Correlation Between Visitors and Page Views

Conclusion

Share this article