What is NumPy and why is it used in AI?

NumPy (Numerical Python) is a library for working with arrays of numbers. It's used in AI because it performs numerical operations much faster than standard Python lists - it processes entire arrays at once rather than looping through items. Machine learning datasets and model weights are arrays of numbers, so NumPy is the foundation of nearly every AI pipeline.

What is the difference between NumPy and Pandas?

NumPy is optimised for pure numerical operations on arrays. Pandas builds on NumPy and adds structure for working with labelled, mixed-type data. NumPy is what happens inside models during training. Pandas is what you use to load, explore, and clean your data before it goes anywhere near a model.

Is NumPy and Pandas hard to learn?

Not particularly. Pandas has many functions but you only need around 20 to be productive: how to load data (read_csv), inspect it (head, info, describe), filter it (boolean indexing), group and summarise it (groupby, agg), and handle missing values (dropna, fillna). NumPy basics are straightforward for array creation and arithmetic. The harder skill is thinking systematically about your data.

Data Analysis with NumPy and Pandas: A Beginner's Introduction

Q: What is a Pandas DataFrame?

A Pandas DataFrame is a table-like data structure with named columns and rows, similar to a spreadsheet. Each column can hold a different type of data (numbers, text, dates). DataFrames make it easy to load, inspect, filter, group, and clean data before passing it to a machine learning model.

NumPy and Pandas aren't AI libraries - they're data libraries. But every AI project loads, cleans, and prepares data before training a model. That means NumPy and Pandas are part of every AI pipeline whether you think about them or not.

NumPy and Pandas for AI data analysis in Python

John Bowman AI Strategist & Developer

Unit 45 April 20269 min read

menu_book In this lesson expand_more

What NumPy Is
What Pandas Is
Loading and Inspecting Data
Filtering, Grouping, Summarising
Why Both Appear in Every AI Project
Lesson Quiz

Listen to this lesson

0:00

What NumPy Is and Why Arrays Matter

NumPy stands for "Numerical Python." It's a library for working with arrays - ordered collections of numbers organised in a grid. The NumPy documentation is thorough and worth bookmarking.

Why not just use Python lists? Lists work fine for ten numbers. For a million, they get slow. NumPy arrays are fast because they're optimised for numerical operations. Every element is the same type, so the computer can process them efficiently.

import numpy as np

scores = np.array([85, 92, 78, 88, 95])

scores + 10        # adds 10 to every score
scores > 85        # returns [False, True, False, True, True]
scores.mean()      # returns 87.6
scores.std()       # returns standard deviation

With a Python list, you'd loop through each item. With NumPy, you write the operation once and it applies to all items simultaneously - faster and cleaner.

In machine learning, your data arrives as arrays. A dataset with 10,000 rows and 50 columns is a 10,000 by 50 array. A trained model is mostly arrays of numbers (weights and biases). NumPy is how you manipulate all of that. These arrays feed directly into gradient descent during training.

What Pandas Is and What a DataFrame Is

Pandas builds on NumPy. While NumPy is great for numbers, it doesn't understand that one column represents ages and another represents postcodes. Pandas does. The Pandas documentation covers every method in detail.

A DataFrame is like a spreadsheet. It has rows and columns. Each column has a name. Each row represents one record.

import pandas as pd

df = pd.read_csv('employees.csv')

df['age']           # gives you just the age column
df['salary'] * 1.1  # salary with a 10% increase
df[df['age'] > 30]  # only rows where age > 30
df.groupby('department')['salary'].mean()  # avg salary per dept

This is the actual work of machine learning. You load messy data, explore it, clean it, and prepare it for a model. Most of that work is pandas.

Loading and Inspecting Data with Pandas

When you get a dataset, the first thing you do is look at it.

df = pd.read_csv('data.csv')
df.head()          # shows first 5 rows
df.shape           # shows (10000, 50) - 10000 rows, 50 columns
df.info()          # shows column names and data types
df.describe()      # shows statistics for each numeric column

This takes two minutes and teaches you a lot. How many rows? What types are the columns? Are there missing values? What's the range of each number?

head() is your first move when exploring. It shows actual data, not just statistics. You can spot obvious problems: dates formatted oddly, numbers stored as text, text that should be numbers.

Missing values are common. info() shows which columns have missing entries. Then you decide: drop those rows, or fill them with something reasonable?

df.isnull().sum()  # how many missing values per column
df.dropna()        # remove rows with any missing values
df.fillna(0)       # replace missing values with 0

Simple operations, but these are the foundation of real work.

Basic Operations: Filtering, Grouping, Summarising

Filtering selects a subset of rows based on conditions:

df[df['age'] > 30]
df[df['country'] == 'UK']
df[(df['age'] > 25) & (df['salary'] > 50000)]

Grouping organises rows by category and aggregates them:

df.groupby('department')['salary'].mean()  # avg salary per dept
df.groupby('country').size()               # headcount per country

Summarising condenses data into key statistics:

df['age'].mean()         # average age
df['salary'].max()       # highest salary
df['age'].value_counts() # count at each age

These look simple but they're powerful. What's the average salary per department? Which products have the highest defect rate? How many customers bought something last month? Real machine learning work is mostly these operations: load, filter, group, summarise, clean, prepare, train. That's the loop.

Why Both Libraries Appear in Every AI Project

NumPy handles numerical operations at scale. When a model trains, it's doing millions of NumPy operations underneath. When you evaluate performance, you use NumPy to compute accuracy, loss, and other metrics - the ones explained in the model evaluation lesson.

Pandas handles the human side. Real data is messy - CSVs with missing values, wrong column types, duplicates. Pandas lets you explore, clean, and prepare before anything goes near a model.

Whether you're training with scikit-learn, TensorFlow, or PyTorch, you'll use both. They're not optional extras. They're part of the pipeline.

The learning curve isn't steep. You need maybe 20 Pandas functions to be productive: read_csv, head, info, describe, boolean indexing, groupby, agg, dropna, fillna. Learn those and you can handle most data preparation tasks. You'll pick up others as you need them. The harder skill is learning to think systematically about what your data actually contains and what it means. That thinking is also what separates good from mediocre work in Python for AI.

Check your understanding

2 questions - select an answer then check it

Question 1 of 2

What is the key advantage of NumPy arrays over standard Python lists for machine learning?

Question 2 of 2

What is a Pandas DataFrame?

Deep Dive Podcast

Podcast Version

Created with Google NotebookLM · AI-generated audio overview

0:00 0:00