Kickstart Your Data Skills: Perform Real Statistics Using Only the Command Line
Introduction
Many beginners believe that data analysis requires installing Python, R, or other complex tools. Surprisingly, your computer already includes a powerful set of utilities capable of handling real statistical work—the Unix command line.
These built-in tools can process large files extremely quickly, automate repetitive tasks, and run on practically any Linux or macOS machine (and on Windows through WSL). No extra software is necessary.
In this walkthrough, you’ll learn how to perform core statistical operations directly in your terminal using standard Unix commands.
A full Bash script version of this guide is available on GitHub—following along while typing the commands will help you understand each step better.
Before you begin, make sure you have:
- A Unix-based environment (Linux, macOS, or Windows with WSL).
- A terminal window open.
- No external libraries—everything here uses default system utilities.
Creating a Sample Dataset
Let’s start by generating a CSV file that represents daily website metrics. Run this block in your terminal:
cat > traffic.csv << EOF date,visitors,page_views,bounce_rate 2024-01-01,1250,4500,45.2 2024-01-02,1180,4200,47.1 2024-01-03,1520,5800,42.3 2024-01-04,1430,5200,43.8 2024-01-05,980,3400,51.2 2024-01-06,1100,3900,48.5 2024-01-07,1680,6100,40.1 2024-01-08,1550,5600,41.9 2024-01-09,1420,5100,44.2 2024-01-10,1290,4700,46.3 EOF
This creates traffic.csv with ten days of sample data.
Exploring the Dataset
Counting Records
To see how many lines the file contains:
wc -l traffic.csv
This returns 11, meaning there are 10 data rows plus the header.
Previewing the File
View the first few lines with:
head -n 5 traffic.csv
You’ll see the header and the four rows beneath it.
Pulling Out a Single Column
To extract only the visitors column:
cut -d',' -f2 traffic.csv | tail -n +2
cut isolates column 2, and tail -n +2 removes the header.
Central Tendency: Mean, Median, Mode
Calculating the Mean (Average)
cut -d',' -f2 traffic.csv | tail -n +2 | \
awk '{sum+=$1; count++} END {print "Mean:", sum/count}'
Output:
Mean: 1340
Calculating the Median
cut -d',' -f2 traffic.csv | tail -n +2 | sort -n | \
awk '{arr[NR]=$1; count=NR} END {if(count%2==1) print "Median:", arr[(count+1)/2]; else print "Median:", (arr[count/2]+arr[count/2+1])/2}'
Output:
Median: 1355
Finding the Mode
cut -d',' -f2 traffic.csv | tail -n +2 | sort -n | uniq -c | \
sort -rn | head -n 1 | awk '{print "Mode:", $2, "(appears", $1, "times)"}'
This reports the most frequent value.
Understanding Spread: Variability in the Data
Maximum
awk -F',' 'NR>1 {if($2>max) max=$2} END {print "Maximum:", max}' traffic.csv
Minimum
awk -F',' 'NR==2 {min=$2} NR>2 {if($2<min) min=$2} END {print "Minimum:", min}' traffic.csv
Min and Max in One Pass
awk -F',' 'NR==2 {min=$2; max=$2} NR>2 {if($2<min) min=$2; if($2>max) max=$2} END {print "Min:", min, "Max:", max}' traffic.csv
Population Standard Deviation
awk -F',' 'NR>1 {sum+=$2; sumsq+=$2*$2; count++} \
END {mean=sum/count; print "Std Dev:", sqrt((sumsq/count)-(mean*mean))}' traffic.csv
Output:
Std Dev: 207.364
Sample Standard Deviation
awk -F',' 'NR>1 {sum+=$2; sumsq+=$2*$2; count++} \
END {mean=sum/count; print "Sample Std Dev:", sqrt((sumsq-(sum*sum/count))/(count-1))}' traffic.csv
Variance
awk -F',' 'NR>1 {sum+=$2; sumsq+=$2*$2; count++} \
END {mean=sum/count; var=(sumsq/count)-(mean*mean); print "Variance:", var}'
Percentiles and Quartiles
Quartiles (Q1, Q2, Q3)
cut -d',' -f2 traffic.csv | tail -n +2 | sort -n | awk '
{arr[NR]=$1; count=NR}
END {
q1 = arr[int((count+1)/4)]
q2 = (count%2==1) ? arr[int((count+1)/2)] : (arr[count/2] + arr[count/2+1]) / 2
q3 = arr[int(3*(count+1)/4)]
print "Q1:", q1
print "Q2:", q2
print "Q3:", q3
}'
Calculating Any Percentile
PERCENTILE=90
cut -d',' -f2 traffic.csv | tail -n +2 | sort -n | \
awk -v p=$PERCENTILE '
{arr[NR]=$1; count=NR}
END {
pos = (count+1) * p/100
idx = int(pos)
frac = pos - idx
if(idx >= count) print p "th percentile:", arr[count]
else print p "th percentile:", arr[idx] + frac * (arr[idx+1] - arr[idx])
}'
Analyzing Multiple Columns at Once
Averaging Visitors, Page Views, and Bounce Rate
awk -F',' '
NR>1 {
v_sum += $2
pv_sum += $3
br_sum += $4
count++
}
END {
print "Average visitors:", v_sum/count
print "Average page views:", pv_sum/count
print "Average bounce rate:", br_sum/count
}' traffic.csv
Correlation Between Visitors and Page Views
awk -F', *' '
NR>1 {
x[NR-1] = $2
y[NR-1] = $3
sum_x += $2
sum_y += $3
count++
}
END {
if (count < 2) exit
mean_x = sum_x / count
mean_y = sum_y / count
for (i = 1; i <= count; i++) {
dx = x[i] - mean_x
dy = y[i] - mean_y
cov += dx * dy
var_x += dx * dx
var_y += dy * dy
}
sd_x = sqrt(var_x / count)
sd_y = sqrt(var_y / count)
print "Correlation:", (cov / count) / (sd_x * sd_y)
}' traffic.csv
This computes the Pearson correlation coefficient.
Conclusion
The Unix command line is far more capable than many new data scientists realize. With just a few built-in tools like awk, cut, and sort, you can calculate averages, variance, percentiles, correlations, and more—without relying on external libraries.
These techniques complement your Python or R workflow, offering a lightweight way to inspect, validate, and explore data.
Since every Unix-style environment includes these tools, you can practice and apply these skills anywhere.
Open your terminal and start experimenting—you’ll be surprised how powerful the command line can be.
Share this article
Loading comments...