Data Analysis with NumPy and Pandas: A Beginner's Introduction
NumPy and Pandas aren't AI libraries - they're data libraries. But every AI project loads, cleans, and prepares data before training a model. That means NumPy and Pandas are part of every AI pipeline whether you think about them or not.
What NumPy Is and Why Arrays Matter
NumPy stands for "Numerical Python." It's a library for working with arrays - ordered collections of numbers organised in a grid.
Why not just use Python lists? Lists work fine for ten numbers. For a million, they get slow. NumPy arrays are fast because they're optimised for numerical operations. Every element is the same type, so the computer can process them efficiently.
import numpy as np
scores = np.array([85, 92, 78, 88, 95])
scores + 10 # adds 10 to every score
scores > 85 # returns [False, True, False, True, True]
scores.mean() # returns 87.6
scores.std() # returns standard deviation
With a Python list, you'd loop through each item. With NumPy, you write the operation once and it applies to all items simultaneously - faster and cleaner.
In machine learning, your data arrives as arrays. A dataset with 10,000 rows and 50 columns is a 10,000 by 50 array. A trained model is mostly arrays of numbers (weights and biases). NumPy is how you manipulate all of that.
What Pandas Is and What a DataFrame Is
Pandas builds on NumPy. While NumPy is great for numbers, it doesn't understand that one column represents ages and another represents postcodes. Pandas does.
A DataFrame is like a spreadsheet. It has rows and columns. Each column has a name. Each row represents one record.
import pandas as pd
df = pd.read_csv('employees.csv')
df['age'] # gives you just the age column
df['salary'] * 1.1 # salary with a 10% increase
df[df['age'] > 30] # only rows where age > 30
df.groupby('department')['salary'].mean() # avg salary per dept
This is the actual work of machine learning. You load messy data, explore it, clean it, and prepare it for a model. Most of that work is pandas.
Loading and Inspecting Data with Pandas
When you get a dataset, the first thing you do is look at it.
df = pd.read_csv('data.csv')
df.head() # shows first 5 rows
df.shape # shows (10000, 50) - 10000 rows, 50 columns
df.info() # shows column names and data types
df.describe() # shows statistics for each numeric column
This takes two minutes and teaches you a lot. How many rows? What types are the columns? Are there missing values? What's the range of each number?
head() is your first move when exploring. It shows actual data, not just statistics. You can spot obvious problems: dates formatted oddly, numbers stored as text, text that should be numbers.
Missing values are common. info() shows which columns have missing entries. Then you decide: drop those rows, or fill them with something reasonable?
df.isnull().sum() # how many missing values per column
df.dropna() # remove rows with any missing values
df.fillna(0) # replace missing values with 0
Simple operations, but these are the foundation of real work.
Basic Operations: Filtering, Grouping, Summarising
Filtering selects a subset of rows based on conditions:
df[df['age'] > 30]
df[df['country'] == 'UK']
df[(df['age'] > 25) & (df['salary'] > 50000)]
Grouping organises rows by category and aggregates them:
df.groupby('department')['salary'].mean() # avg salary per dept
df.groupby('country').size() # headcount per country
Summarising condenses data into key statistics:
df['age'].mean() # average age
df['salary'].max() # highest salary
df['age'].value_counts() # count at each age
These look simple but they're powerful. What's the average salary per department? Which products have the highest defect rate? How many customers bought something last month? Real machine learning work is mostly these operations: load, filter, group, summarise, clean, prepare, train. That's the loop.
Why Both Libraries Appear in Every AI Project
NumPy handles numerical operations at scale. When a model trains, it's doing millions of NumPy operations underneath. When you evaluate performance, you use NumPy to compute accuracy, loss, and other metrics.
Pandas handles the human side. Real data is messy - CSVs with missing values, wrong column types, duplicates. Pandas lets you explore, clean, and prepare before anything goes near a model.
Whether you're training with scikit-learn, TensorFlow, or PyTorch, you'll use both. They're not optional extras. They're part of the pipeline.
The learning curve isn't steep. You need maybe 20 Pandas functions to be productive: read_csv, head, info, describe, boolean indexing, groupby, agg, dropna, fillna. Learn those and you can handle most data preparation tasks. You'll pick up others as you need them. The harder skill is learning to think systematically about what your data actually contains and what it means.
Lesson Quiz
Two questions to check your understanding before moving on.
Question 1: What is the key advantage of NumPy arrays over standard Python lists for machine learning?
Question 2: What is a Pandas DataFrame?
Podcast Version
Prefer to listen? The full lesson is available as a podcast episode.