Why NumPy? Data Types, Memory & Efficient Computing
Why NumPy? Data Types, Memory & Efficient Computing
Understand why picking the right data type isn't just a technicality — it can be the difference between code that crawls and code that flies.
The Problem: Numbers Are Not Just Numbers
When you first look at a dataset, everything might seem straightforward — just rows of numbers. But here is something many beginners overlook: not all numbers are equal, and treating them like they are can quietly destroy your program's performance.
Consider a simple table with two columns about people:
| Column | Example Value | Typical Range | Storage Needed |
|---|---|---|---|
| Age | 27 | 0 – 120 | Small (7 bits) |
| Net Worth ($) | 60,000,000,000 | 0 – 60 billion+ | Large (32–64 bits) |
Both columns contain integers. Yet they have vastly different ranges and therefore require different amounts of memory to store accurately. If you blindly use the same data type for both, you either waste memory (bad for large datasets) or risk data overflow (even worse!).
This problem gets even more interesting when you bring in currencies. The dollar ranges up to billions, but a highly devalued currency might need to go into the trillions. Your data type choice must account for the real-world magnitude of the data you're working with.
A Quick Primer: Bits, Bytes & Binary
To truly understand data types, you need to speak a little bit of the computer's language — binary. Don't worry, we'll keep it practical.
What is a Bit?
A bit is the smallest unit of memory. It can only hold one of two values: 0 or 1. Think of it as a light switch — it's either off (0) or on (1).
With n bits, you can represent 2โฟ unique values, ranging from 0 to 2โฟ−1.
How Many Bits Do We Need for Age?
Our maximum age is about 120. Let's figure out how many bits we need:
7-bit representation of 127 (all ones = max value)
With 7 bits, we can store values from 0 up to 127 (which is 2⁷ − 1). Since our maximum age is 120, seven bits is just enough.
| Bits (n) | Max Value (2โฟ−1) | Bytes | Good For |
|---|---|---|---|
| 7 | 127 | < 1 byte | Age, small counts |
| 8 | 255 | 1 byte | Small positive integers |
| 16 | 65,535 | 2 bytes | Scores, measurements |
| 32 | ~4.3 billion | 4 bytes | Net worth (USD millions) |
| 64 | ~18.4 quintillion | 8 bytes | Financial transactions |
Python's Dirty Secret: Memory Overhead
Python is a beautiful, easy-to-read language. But it hides a significant cost from you. Let's look at what happens when you store a simple integer in Python:
# Pure Python — storing a simple age
x = 5
print(type(x)) # <class 'int'>
# How much memory does this actually take?
import sys
print(sys.getsizeof(x)) # Output: 28 bytes !
Python
Wait — 28 bytes to store the number 5? Theoretically we only need about 3 bits! What's going on?
Python Wraps Everything in Objects
Python is an object-oriented language. Even a simple integer like 5 is not stored as a raw number in memory. Instead, Python wraps it in an object that carries:
- A reference count (for garbage collection)
- A type pointer (to confirm it's an int)
- The actual numeric value
- Other housekeeping metadata
This design makes Python very easy to use — you never have to manage memory manually. But it means every number you create consumes roughly 100× more memory than it theoretically needs.
| Language/Tool | Memory for integer 5 | Why |
|---|---|---|
| Pure Python | ~28 bytes | Object overhead, reference counting |
| NumPy int8 | 1 byte (8 bits) | Raw C-style integer, no wrapping |
| NumPy int32 | 4 bytes | Raw C-style integer |
| NumPy int64 | 8 bytes | Raw C-style integer (default) |
The Maths of Scale
Let's make this concrete. Imagine you have a dataset of Nigerian mobile transactions — 500 million rows, with an "amount" column stored as Python integers:
# 500 million records × 28 bytes (Python int) = ?
python_memory = 500_000_000 * 28 # 14,000,000,000 bytes
print(f"Python: {python_memory / 1e9:.1f} GB") # 14.0 GB
# Same data with NumPy int32 (4 bytes)
numpy_memory = 500_000_000 * 4
print(f"NumPy int32: {numpy_memory / 1e9:.1f} GB") # 2.0 GB
# Savings!
print(f"You saved {(python_memory - numpy_memory) / 1e9:.1f} GB") # 12.0 GB
Python
That's 12 gigabytes saved by simply choosing the right data type for one column. Now multiply this across 50 columns in a real dataset.
Enter NumPy: Precision at Scale
NumPy (Numerical Python) solves Python's memory problem by letting you create numbers with exact bit-size specifications — just like low-level languages such as C or Fortran, but with a friendly Python interface.
Creating NumPy Scalars with Specific Types
import numpy as np
# Store age — we only need 8 bits (0 to 255)
age = np.int8(27)
print(age) # 27
print(age.nbytes) # 1 byte — perfect!
# Store net worth — needs 32 or 64 bits
net_worth = np.int64(60_000_000_000)
print(net_worth.nbytes) # 8 bytes
# Other available integer types
np.int8(5) # 1 byte | range: -128 to 127
np.int16(5) # 2 bytes | range: -32768 to 32767
np.int32(5) # 4 bytes | range: -2.1B to 2.1B
np.int64(5) # 8 bytes | range: very large numbers
# Unsigned (no negatives, doubles positive range)
np.uint8(200) # 1 byte | range: 0 to 255
Python
int8) can hold negative and positive numbers: −128 to 127. Unsigned integers (e.g. uint8) can only hold 0 and above: 0 to 255. For "age", unsigned makes sense since age is never negative. For "profit/loss", you'd use signed.
It Also Applies to Floats
Everything we've discussed applies equally to floating-point numbers (numbers with decimal points). NumPy gives you float16, float32, and float64. The smaller the float, the less precision — so choose wisely based on how precise your data needs to be.
# Float types in NumPy
price_low_precision = np.float16(3.14) # 2 bytes
price_high_precision = np.float64(3.14159265358979) # 8 bytes
print(price_low_precision) # 3.14 (slightly rounded)
print(price_high_precision) # 3.14159265358979
Python
NumPy Arrays: Contiguous Memory & CPU Power
Beyond individual numbers, NumPy's biggest advantage is its array — a collection of values stored efficiently in memory. This is where NumPy truly shines compared to Python's built-in list.
Python Lists vs NumPy Arrays
| Feature | Python List | NumPy Array |
|---|---|---|
| Memory layout | Scattered (non-contiguous) | Contiguous (side by side) |
| Element type | Mixed (any type) | Uniform (same type) |
| CPU optimisation | ❌ No SIMD/vectorisation | ✅ Uses CPU SIMD instructions |
| Computation speed | Slow for math ops | Very fast |
| Memory per element | High (object overhead) | Minimal (raw bytes) |
What Does "Contiguous in Memory" Mean?
Imagine your computer's RAM as a long street of houses, each house being one memory slot. When Python stores a list, the three numbers might end up in houses 1, 47, and 203 — far apart. The CPU has to travel across the street each time it needs the next number.
With NumPy, all elements of an array are placed side-by-side: houses 1, 2, 3. The CPU can scoop them all up in one efficient pass — and even process several at once using special instructions called SIMD (Single Instruction, Multiple Data).
import numpy as np
# Python list — elements may be scattered in memory
py_list = [3, 2, 4]
# NumPy array — elements are contiguous, typed, and efficient
np_array = np.array([3, 2, 4], dtype=np.int8)
print(np_array.dtype) # int8
print(np_array.nbytes) # 3 bytes total (3 elements × 1 byte each)
print(np_array.itemsize) # 1 byte per element
Python
Why Does This Matter for Data Analysis?
Modern data analysis and machine learning involves doing millions of arithmetic operations on arrays — adding columns, multiplying matrices, computing averages. NumPy's contiguous memory and CPU-level optimisations make these operations 10x to 100x faster than doing the same with Python lists.
5 in approximately 28 bytes instead of just a few bits?Hands-On Activities
These exercises will reinforce your understanding. Open a Jupyter Notebook, Google Colab, or any Python environment and work through each one.
Compare how much memory Python vs NumPy uses for the same numbers.
import sys
import numpy as np
# Step 1: Create a Python list of 1 million ages (random 0-100)
import random
py_ages = [random.randint(0, 100) for _ in range(1_000_000)]
# Step 2: Create the same data as a NumPy array (uint8)
np_ages = np.array(py_ages, dtype=np.uint8)
# Step 3: Compare sizes
py_size = sys.getsizeof(py_ages) + sum(sys.getsizeof(x) for x in py_ages)
np_size = np_ages.nbytes
print(f"Python list size: {py_size / 1e6:.2f} MB")
print(f"NumPy array size: {np_size / 1e6:.2f} MB")
print(f"Ratio: Python is {py_size / np_size:.0f}x larger")
Python
๐ Your task: Run this code. Note the ratio. Then try changing uint8 to int32 and int64. How does the NumPy size change? How does it compare to Python each time?
Imagine you're building a dataset of Konga.com transaction records. For each column below, choose the most appropriate NumPy dtype and justify your choice.
| Column | Description | Your dtype choice | Your reasoning |
|---|---|---|---|
| customer_age | Customer age, 0–100 | ? | Write your answer |
| item_price_naira | Price in Naira, up to ₦5,000,000 | ? | Write your answer |
| quantity | Items ordered, 1–9999 | ? | Write your answer |
| discount_pct | Discount %, e.g. 12.5% | ? | Write your answer |
| rating | Product rating, 1–5 | ? | Write your answer |
๐ After writing your choices on paper, try creating these as NumPy arrays and verifying their .dtype and .nbytes.
See NumPy's speed advantage with your own eyes by timing the same operation on a Python list vs a NumPy array.
import numpy as np
import time
SIZE = 10_000_000 # 10 million elements
# Python list approach
py_list = list(range(SIZE))
start = time.time()
py_result = [x * 2 for x in py_list]
py_time = time.time() - start
# NumPy array approach
np_arr = np.arange(SIZE, dtype=np.int32)
start = time.time()
np_result = np_arr * 2
np_time = time.time() - start
print(f"Python list: {py_time:.3f} seconds")
print(f"NumPy array: {np_time:.4f} seconds")
print(f"NumPy is ~{py_time/np_time:.0f}x faster!")
Python
๐ Challenge: Try increasing SIZE to 100 million. What happens to the gap? Also try int8 vs int64 — does the dtype affect speed?
What happens when you try to store a number that's too large for the dtype? Let's find out!
import numpy as np
# int8 can only hold -128 to 127
a = np.int8(127)
print(a) # 127 ✅
b = np.int8(128) # What happens here? Run it!
print(b) # -128 !! (overflow wraps around)
c = np.int8(200)
print(c) # Try to predict this before running
Python
- Run the code above. What values do you get for
bandc? - Can you explain why overflow "wraps around"? (Hint: think about binary — what happens when all 8 bits flip from 1 to the next number?)
- What does this teach you about choosing dtypes in real financial data?
Lesson Summary
Here's everything this lesson covered, distilled into its core ideas:
x = 5 typically consume in memory?- Binary arithmetic — understand how computers represent negative numbers (two's complement) and floats (IEEE 754)
- NumPy array operations — vectorised maths, broadcasting, and slicing
- Pandas dtypes — how Pandas (built on NumPy) handles dtype selection in DataFrames
- Memory profiling — use
memory_profilerortracemallocto profile real code
Post a Comment