Back to Home
data_science

Kickstart Your Data Skills: Perform Real Statistics Using Only the Command Line

tunjiazeez24@gmail.com
December 10, 2025
1 views
Kickstart Your Data Skills: Perform Real Statistics Using Only the Command Line



Introduction

Many beginners believe that data analysis requires installing Python, R, or other complex tools. Surprisingly, your computer already includes a powerful set of utilities capable of handling real statistical work—the Unix command line.

These built-in tools can process large files extremely quickly, automate repetitive tasks, and run on practically any Linux or macOS machine (and on Windows through WSL). No extra software is necessary.

In this walkthrough, you’ll learn how to perform core statistical operations directly in your terminal using standard Unix commands.

A full Bash script version of this guide is available on GitHub—following along while typing the commands will help you understand each step better.

Before you begin, make sure you have:

  • A Unix-based environment (Linux, macOS, or Windows with WSL).
  • A terminal window open.
  • No external libraries—everything here uses default system utilities.


Creating a Sample Dataset

Let’s start by generating a CSV file that represents daily website metrics. Run this block in your terminal:

cat > traffic.csv << EOF
date,visitors,page_views,bounce_rate
2024-01-01,1250,4500,45.2
2024-01-02,1180,4200,47.1
2024-01-03,1520,5800,42.3
2024-01-04,1430,5200,43.8
2024-01-05,980,3400,51.2
2024-01-06,1100,3900,48.5
2024-01-07,1680,6100,40.1
2024-01-08,1550,5600,41.9
2024-01-09,1420,5100,44.2
2024-01-10,1290,4700,46.3
EOF

This creates traffic.csv with ten days of sample data.

Exploring the Dataset


Counting Records

To see how many lines the file contains:

wc -l traffic.csv

This returns 11, meaning there are 10 data rows plus the header.


Previewing the File

View the first few lines with:

head -n 5 traffic.csv

You’ll see the header and the four rows beneath it.


Pulling Out a Single Column

To extract only the visitors column:

cut -d',' -f2 traffic.csv | tail -n +2

cut isolates column 2, and tail -n +2 removes the header.


Central Tendency: Mean, Median, Mode

Calculating the Mean (Average)

cut -d',' -f2 traffic.csv | tail -n +2 | \
awk '{sum+=$1; count++} END {print "Mean:", sum/count}'

Output:

Mean: 1340


Calculating the Median

cut -d',' -f2 traffic.csv | tail -n +2 | sort -n | \
awk '{arr[NR]=$1; count=NR} END {if(count%2==1) print "Median:", arr[(count+1)/2]; else print "Median:", (arr[count/2]+arr[count/2+1])/2}'

Output:

Median: 1355


Finding the Mode

cut -d',' -f2 traffic.csv | tail -n +2 | sort -n | uniq -c | \
sort -rn | head -n 1 | awk '{print "Mode:", $2, "(appears", $1, "times)"}'

This reports the most frequent value.

Understanding Spread: Variability in the Data


Maximum

awk -F',' 'NR>1 {if($2>max) max=$2} END {print "Maximum:", max}' traffic.csv


Minimum

awk -F',' 'NR==2 {min=$2} NR>2 {if($2<min) min=$2} END {print "Minimum:", min}' traffic.csv


Min and Max in One Pass

awk -F',' 'NR==2 {min=$2; max=$2} NR>2 {if($2<min) min=$2; if($2>max) max=$2} END {print "Min:", min, "Max:", max}' traffic.csv


Population Standard Deviation

awk -F',' 'NR>1 {sum+=$2; sumsq+=$2*$2; count++} \
END {mean=sum/count; print "Std Dev:", sqrt((sumsq/count)-(mean*mean))}' traffic.csv

Output:

Std Dev: 207.364


Sample Standard Deviation

awk -F',' 'NR>1 {sum+=$2; sumsq+=$2*$2; count++} \
END {mean=sum/count; print "Sample Std Dev:", sqrt((sumsq-(sum*sum/count))/(count-1))}' traffic.csv


Variance

awk -F',' 'NR>1 {sum+=$2; sumsq+=$2*$2; count++} \
END {mean=sum/count; var=(sumsq/count)-(mean*mean); print "Variance:", var}'


Percentiles and Quartiles

Quartiles (Q1, Q2, Q3)

cut -d',' -f2 traffic.csv | tail -n +2 | sort -n | awk '
{arr[NR]=$1; count=NR}
END {
  q1 = arr[int((count+1)/4)]
  q2 = (count%2==1) ? arr[int((count+1)/2)] : (arr[count/2] + arr[count/2+1]) / 2
  q3 = arr[int(3*(count+1)/4)]
  print "Q1:", q1
  print "Q2:", q2
  print "Q3:", q3
}'


Calculating Any Percentile

PERCENTILE=90
cut -d',' -f2 traffic.csv | tail -n +2 | sort -n | \
awk -v p=$PERCENTILE '
{arr[NR]=$1; count=NR}
END {
  pos = (count+1) * p/100
  idx = int(pos)
  frac = pos - idx
  if(idx >= count) print p "th percentile:", arr[count]
  else print p "th percentile:", arr[idx] + frac * (arr[idx+1] - arr[idx])
}'


Analyzing Multiple Columns at Once

Averaging Visitors, Page Views, and Bounce Rate

awk -F',' '
NR>1 {
  v_sum += $2
  pv_sum += $3
  br_sum += $4
  count++
}
END {
  print "Average visitors:", v_sum/count
  print "Average page views:", pv_sum/count
  print "Average bounce rate:", br_sum/count
}' traffic.csv


Correlation Between Visitors and Page Views

awk -F', *' '
NR>1 {
  x[NR-1] = $2
  y[NR-1] = $3

  sum_x += $2
  sum_y += $3
  count++
}
END {
  if (count < 2) exit

  mean_x = sum_x / count
  mean_y = sum_y / count

  for (i = 1; i <= count; i++) {
    dx = x[i] - mean_x
    dy = y[i] - mean_y

    cov   += dx * dy
    var_x += dx * dx
    var_y += dy * dy
  }

  sd_x = sqrt(var_x / count)
  sd_y = sqrt(var_y / count)

  print "Correlation:", (cov / count) / (sd_x * sd_y)
}' traffic.csv

This computes the Pearson correlation coefficient.


Conclusion

The Unix command line is far more capable than many new data scientists realize. With just a few built-in tools like awk, cut, and sort, you can calculate averages, variance, percentiles, correlations, and more—without relying on external libraries.

These techniques complement your Python or R workflow, offering a lightweight way to inspect, validate, and explore data.

Since every Unix-style environment includes these tools, you can practice and apply these skills anywhere.

Open your terminal and start experimenting—you’ll be surprised how powerful the command line can be.


Share this article

Loading comments...