Neural Network Fundamentals
Цей контент ще не доступний вашою мовою.
AI/ML Engineering Track | Complexity:
[COMPLEX]| Time: 5-6
Reading Time: 5-6 hours Phase: 6 - Deep Learning Foundations
Section titled “Reading Time: 5-6 hours Phase: 6 - Deep Learning Foundations”Princeton, New Jersey. February 2005. 11:47 PM. Travis Oliphant had a problem. As an astronomer, he needed to process massive arrays of telescope data—millions of numbers representing distant galaxies. Python was perfect for writing analysis scripts, but the existing numerical libraries were a mess. There were two competing packages, Numeric and numarray, and neither could handle his data efficiently.
So Oliphant did what any frustrated scientist would do: he merged them. Working nights and weekends, he rewrote core components in C, unified the competing APIs, and released something called “NumPy 1.0” in October 2006.
He had no idea he was building the foundation for the AI revolution.
“I just needed to process telescope data. I never imagined that the same operations—matrix multiplication, broadcasting, vectorization—would become the core primitives of deep learning.” — Travis Oliphant, NumPy creator and founder of Anaconda
Today, every neural network training run—from gpt-5 to Stable Diffusion—ultimately relies on the array operations Oliphant designed for looking at stars.
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”By the end of this module, you will:
- Master NumPy for high-performance numerical computing
- Manipulate data fluently with pandas DataFrames
- Create publication-quality visualizations with matplotlib and seaborn
- Understand WHY these tools are essential for machine learning
- Build a reusable ML data toolkit
Introduction: The Scientific Python Ecosystem
Section titled “Introduction: The Scientific Python Ecosystem”You’ve spent 24 modules building AI applications using APIs, frameworks, and high-level tools. Now we’re going deeper. Phase 6 is about understanding how neural networks actually work—not just using them, but building them from scratch.
But before we can build neural networks, we need the right tools. Imagine trying to build a house with your bare hands versus having power tools. NumPy, pandas, and matplotlib are your power tools for machine learning.
The Story Behind Python’s ML Dominance
Section titled “The Story Behind Python’s ML Dominance”Python wasn’t designed for machine learning. Guido van Rossum created it in 1989 as a teaching language—easy to read, easy to write, suitable for beginners. For years, serious number-crunching happened in Fortran, MATLAB, or C++.
So how did a “slow” teaching language become the foundation of AI?
The Killer Feature: Wrapper Libraries
In the late 1990s, developers discovered Python’s secret weapon: it could easily wrap high-performance C and Fortran code. You could write your logic in readable Python while the heavy computation happened in optimized native code.
This led to a snowball effect:
- 1995: Numeric (NumPy’s ancestor) made array operations easy
- 2001: SciPy added scientific computing tools
- 2007: scikit-learn made ML accessible to non-experts
- 2012: Theano proved deep learning could work in Python
- 2015: TensorFlow and Keras brought deep learning to the masses
- 2017: PyTorch made deep learning research-friendly
Each library attracted more users. More users meant more contributors. More contributors built more libraries. By 2020, Python’s ML ecosystem was so extensive that switching to another language meant abandoning thousands of battle-tested tools.
The Network Effect
Today, Python’s dominance is self-reinforcing:
- Job postings require Python → Students learn Python
- Research papers use Python → Practitioners use Python
- Libraries are written in Python → New tools support Python first
The few alternatives that exist (Julia, R, MATLAB) are fighting an uphill battle against this network effect. Even Julia, which is genuinely faster than Python for many tasks, struggles to build ecosystem momentum.
Why Python Dominates ML/AI
Section titled “Why Python Dominates ML/AI”In 2024, Python handles:
- 92% of machine learning projects
- 85% of data science workflows
- 100% of the top deep learning frameworks (PyTorch, TensorFlow, JAX)
But Python is slow! A naive Python loop is 10-100x slower than C. So how does it dominate computationally intensive ML?
The answer: Python is the glue, not the engine.
┌─────────────────────────────────────────────────┐│ Python Code ││ (Easy to write, flexible, readable) │└──────────────────────┬──────────────────────────┘ │ calls ▼┌─────────────────────────────────────────────────┐│ NumPy / BLAS / LAPACK / cuDNN / MKL ││ (Optimized C/Fortran/CUDA, blazing fast) │└─────────────────────────────────────────────────┘You write simple Python. Behind the scenes, highly optimized native code does the heavy lifting. This is the genius of the Scientific Python ecosystem.
Did You Know? The Origins of Scientific Python
Section titled “Did You Know? The Origins of Scientific Python”The Birth of NumPy: A Tale of Two Libraries
Section titled “The Birth of NumPy: A Tale of Two Libraries”In the early 2000s, Python had a problem: TWO competing array libraries.
Numeric (1995): Created by Jim Hugunin at MIT. Fast but limited.
Numarray (2001): Created by Space Telescope Science Institute. More features but slower.
The community was split. Code written for one library wouldn’t work with the other. It was chaos.
Enter Travis Oliphant, a grad student at the Mayo Clinic who needed both libraries’ features. In 2005, he did something audacious: he merged them into NumPy.
“I had about 3 months of time between finishing my PhD and starting my new job. I thought, ‘How hard can it be?’” — Travis Oliphant
It took him those 3 months, working 80-hour weeks, rewriting both libraries into one cohesive package. NumPy 1.0 was released in 2006.
The impact: NumPy became the foundation for all of scientific Python. pandas, scikit-learn, TensorFlow, PyTorch—all built on NumPy arrays.
The pandas Story: Wall Street Meets Open Source
Section titled “The pandas Story: Wall Street Meets Open Source”In 2008, Wes McKinney was working at AQR Capital Management, a hedge fund. He was frustrated with the clunky tools for analyzing financial data.
“I remember thinking, ‘I cannot believe I have to suffer through this horrible, horrible R interface.’” — Wes McKinney
He started building a library for himself. It was so useful that AQR let him open-source it in 2009. He named it pandas (Panel Data System).
By 2012, pandas had revolutionized data analysis in Python. Wes left finance to work on pandas full-time, funded by various companies who depended on it.
Fun fact: AQR initially resisted open-sourcing pandas, fearing it would help competitors. Wes convinced them that the community contributions would far outweigh any competitive advantage they might lose. He was right—pandas now has over 2,000 contributors.
matplotlib: The Scientist Who Needed Better Graphs
Section titled “matplotlib: The Scientist Who Needed Better Graphs”In 2002, John Hunter was a neurobiologist doing EEG analysis. He needed to visualize brain signals but found existing tools inadequate.
“I was frustrated with the limited plotting capabilities of the tools available at the time and decided to write my own.” — John Hunter
He created matplotlib to mimic MATLAB’s plotting (hence the name). It became the standard plotting library in Python.
Tragically, John Hunter passed away in 2012 from cancer. The matplotlib project continues in his memory, maintained by a global community.
His legacy: Every plot in nearly every Jupyter notebook, every figure in thousands of scientific papers, traces back to his work.
NumPy: The Foundation of Numerical Python
Section titled “NumPy: The Foundation of Numerical Python”What is NumPy?
Section titled “What is NumPy?”NumPy (Numerical Python) provides:
- ndarray: A powerful N-dimensional array object
- Broadcasting: Smart element-wise operations
- Linear algebra: Matrix operations, decompositions
- Random numbers: Statistical distributions
- C/Fortran integration: For custom high-performance code
Why Arrays, Not Lists?
Section titled “Why Arrays, Not Lists?”Think of a Python list like a filing cabinet where each drawer can hold anything—a number, a string, a photo, another cabinet. Flexible, but every time you need something, you have to open the drawer, check what’s inside, and figure out how to use it. A NumPy array is like a warehouse with identical boxes stacked in perfect rows—you know exactly what’s in each box and exactly where to find it. That uniformity is what makes NumPy 100-1000x faster.
Python lists are flexible but slow:
# Python list: Each element is a full Python objectpython_list = [1, 2, 3, 4, 5]# Stored as: [ptr] → [PyObject: type, refcount, value]# [ptr] → [PyObject: type, refcount, value]# ...
# NumPy array: Contiguous block of raw memorynumpy_array = np.array([1, 2, 3, 4, 5])# Stored as: [1][2][3][4][5] (just the numbers, packed tight)Memory comparison:
- Python list of 1 million integers: ~28 MB
- NumPy array of 1 million integers: ~4 MB (7x smaller!)
Speed comparison:
# Adding two lists element-wisepython_result = [a + b for a, b in zip(list1, list2)] # ~500ms
# Adding two NumPy arraysnumpy_result = arr1 + arr2 # ~2ms (250x faster!)Core NumPy Concepts
Section titled “Core NumPy Concepts”1. Creating Arrays
Section titled “1. Creating Arrays”import numpy as np
# From Python listsarr = np.array([1, 2, 3, 4, 5])matrix = np.array([[1, 2, 3], [4, 5, 6]])
# Common initializationszeros = np.zeros((3, 4)) # 3x4 array of zerosones = np.ones((2, 3)) # 2x3 array of onesempty = np.empty((2, 2)) # Uninitialized (faster)identity = np.eye(4) # 4x4 identity matrixrange_arr = np.arange(0, 10, 2) # [0, 2, 4, 6, 8]linspace = np.linspace(0, 1, 5) # [0, 0.25, 0.5, 0.75, 1]
# Random arraysrandom_uniform = np.random.rand(3, 3) # Uniform [0, 1)random_normal = np.random.randn(3, 3) # Standard normalrandom_int = np.random.randint(0, 10, (3, 3)) # Random integers2. Array Attributes
Section titled “2. Array Attributes”arr = np.array([[1, 2, 3], [4, 5, 6]])
arr.shape # (2, 3) - 2 rows, 3 columnsarr.ndim # 2 - number of dimensionsarr.size # 6 - total number of elementsarr.dtype # dtype('int64') - data typearr.itemsize # 8 - bytes per elementarr.nbytes # 48 - total bytes (6 * 8)3. Reshaping and Manipulating
Section titled “3. Reshaping and Manipulating”arr = np.arange(12) # [0, 1, 2, ..., 11]
# Reshapereshaped = arr.reshape(3, 4) # 3x4 matrixreshaped = arr.reshape(2, -1) # 2 rows, auto-calculate columns
# Flattenflat = reshaped.flatten() # Returns copyraveled = reshaped.ravel() # Returns view (faster)
# Transposetransposed = reshaped.T # Swap rows and columns
# Stackinga = np.array([1, 2, 3])b = np.array([4, 5, 6])np.vstack([a, b]) # [[1,2,3], [4,5,6]]np.hstack([a, b]) # [1, 2, 3, 4, 5, 6]np.column_stack([a, b]) # [[1,4], [2,5], [3,6]]4. Indexing and Slicing
Section titled “4. Indexing and Slicing”arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
# Basic indexingarr[0, 0] # 1 (first element)arr[1, 2] # 7 (row 1, col 2)arr[-1, -1] # 12 (last element)
# Slicing (start:stop:step)arr[0, :] # [1, 2, 3, 4] (first row)arr[:, 0] # [1, 5, 9] (first column)arr[0:2, 1:3] # [[2,3], [6,7]] (submatrix)arr[::2, :] # Every other row
# Boolean indexing (POWERFUL!)arr[arr > 5] # [6, 7, 8, 9, 10, 11, 12]arr[arr % 2 == 0] # Even numbers
# Fancy indexingarr[[0, 2], :] # Rows 0 and 2arr[:, [0, 3]] # Columns 0 and 35. Broadcasting
Section titled “5. Broadcasting”Broadcasting is NumPy’s superpower—it lets arrays of different shapes work together:
# Scalar broadcastarr = np.array([1, 2, 3])arr + 10 # [11, 12, 13] - 10 "broadcasts" to match arr
# 1D to 2D broadcastmatrix = np.array([[1, 2, 3], [4, 5, 6]])row = np.array([10, 20, 30])matrix + row # [[11,22,33], [14,25,36]]
# Column broadcastcol = np.array([[100], [200]])matrix + col # [[101,102,103], [204,205,206]]Broadcasting rules:
- Arrays with fewer dimensions are padded with 1s on the left
- Arrays with size 1 along a dimension act as if copied along that dimension
- Arrays must have compatible shapes after these rules
Shape (3, 4) + Shape (4,) → Works! (4,) becomes (1, 4), broadcasts to (3, 4)Shape (3, 4) + Shape (3,) → Error! Can't broadcast (3,) to (3, 4)Shape (3, 4) + Shape (3, 1) → Works! (3, 1) broadcasts to (3, 4)6. Vectorized Operations
Section titled “6. Vectorized Operations”NumPy’s real power: operations on entire arrays at once.
# Universal functions (ufuncs) - operate element-wisenp.sqrt(arr) # Square root of each elementnp.exp(arr) # e^x for each elementnp.log(arr) # Natural lognp.sin(arr) # Sinenp.abs(arr) # Absolute value
# Aggregationsarr.sum() # Sum all elementsarr.mean() # Averagearr.std() # Standard deviationarr.min(), arr.max() # Min and maxarr.argmin(), arr.argmax() # Index of min/max
# Along axesmatrix.sum(axis=0) # Sum each columnmatrix.sum(axis=1) # Sum each rowmatrix.mean(axis=0) # Mean of each column7. Linear Algebra
Section titled “7. Linear Algebra”This is where ML really lives:
# Matrix multiplicationA = np.array([[1, 2], [3, 4]])B = np.array([[5, 6], [7, 8]])
A @ B # Matrix multiplication (Python 3.5+)np.dot(A, B) # Same thingnp.matmul(A, B) # Same thing
# Other operationsnp.linalg.inv(A) # Matrix inversenp.linalg.det(A) # Determinantnp.linalg.eig(A) # Eigenvalues and eigenvectorsnp.linalg.svd(A) # Singular Value Decompositionnp.linalg.norm(A) # Matrix/vector norm
# Solving linear systems: Ax = bb = np.array([1, 2])x = np.linalg.solve(A, b)
# QR decomposition (used in many ML algorithms)Q, R = np.linalg.qr(A)Did You Know? NumPy’s Secret Weapons
Section titled “Did You Know? NumPy’s Secret Weapons”BLAS and LAPACK: NumPy’s linear algebra is backed by BLAS (Basic Linear Algebra Subprograms) and LAPACK, libraries originally written in Fortran in the 1970s. These are so optimized that modern Python code using NumPy can be as fast as C code.
Intel MKL: If you install NumPy through Anaconda, you get Intel’s Math Kernel Library (MKL), which uses specialized CPU instructions (AVX, AVX-512) for even faster matrix operations. A matrix multiplication can be 10x faster with MKL!
Memory views: When you slice a NumPy array, you don’t copy data—you create a “view” that shares memory with the original. This is why NumPy is so memory-efficient, but also why modifying a slice modifies the original!
arr = np.array([1, 2, 3, 4, 5])view = arr[1:4]view[0] = 100print(arr) # [1, 100, 3, 4, 5] - Original changed!
# To avoid this, explicitly copy:copy = arr[1:4].copy()The Psychology of Numerical Computing
Section titled “The Psychology of Numerical Computing”Understanding why NumPy is fast helps you write better code. Let’s go deeper than “Python is slow, C is fast.”
Memory Layout Matters More Than You Think
Section titled “Memory Layout Matters More Than You Think”When you create a NumPy array, all elements are stored in a single, contiguous block of memory. Think of it like a filing cabinet where every folder is exactly the same size and in perfect order. A Python list, by contrast, is like a filing cabinet where each drawer contains a note saying “the actual file is in Building B, Room 42, Drawer 7.”
import numpy as npimport sys
# Python list: Each element is a pointer to a separate objectpy_list = [1, 2, 3, 4, 5]# Memory: [ptr1, ptr2, ptr3, ptr4, ptr5]# ↓ ↓ ↓ ↓ ↓# PyInt PyInt PyInt PyInt PyInt
# NumPy array: All values stored contiguouslynp_arr = np.array([1, 2, 3, 4, 5], dtype=np.int64)# Memory: [1|2|3|4|5] - just the raw bytes, packed tight
# Size comparisonprint(f"List size: {sys.getsizeof(py_list)} bytes") # ~120 bytesprint(f"Array size: {np_arr.nbytes} bytes") # ~40 bytesThis contiguous layout has three profound implications:
-
Cache efficiency: Modern CPUs load memory in 64-byte cache lines. With contiguous data, one memory fetch gives you ~8 numbers. With scattered pointers, each number requires a separate fetch.
-
SIMD operations: CPUs can perform the same operation on multiple values simultaneously (Single Instruction, Multiple Data). This only works with contiguous, uniformly-typed data.
-
No type checking: Python must check each element’s type before every operation. NumPy knows all elements are the same type, so it skips this overhead.
Row-Major vs Column-Major: The Hidden Performance Trap
Section titled “Row-Major vs Column-Major: The Hidden Performance Trap”NumPy arrays are stored in “row-major” order (C-style) by default. This means rows are contiguous in memory:
arr = np.array([[1, 2, 3], [4, 5, 6]])# Memory: [1, 2, 3, 4, 5, 6]# ← row 0 → ← row 1 →This has huge performance implications:
import numpy as npimport time
# Create a large matrixmatrix = np.random.rand(10000, 10000)
# Row iteration (fast - follows memory layout)start = time.time()row_sums = [matrix[i, :].sum() for i in range(10000)]print(f"Row iteration: {time.time() - start:.3f}s") # ~0.2s
# Column iteration (slow - jumps across memory)start = time.time()col_sums = [matrix[:, i].sum() for i in range(10000)]print(f"Column iteration: {time.time() - start:.3f}s") # ~1.5s (7x slower!)
# The right way: use NumPy's axis parameterstart = time.time()row_sums_fast = matrix.sum(axis=1) # Sum along columnscol_sums_fast = matrix.sum(axis=0) # Sum along rowsprint(f"Vectorized: {time.time() - start:.3f}s") # ~0.05sThis is why machine learning frameworks often transpose matrices or specify column-major (“Fortran-style”) order for certain operations.
Production War Story: The $3 Million Cache Miss
Section titled “Production War Story: The $3 Million Cache Miss”A hedge fund’s trading algorithm was mysteriously slow. Their quantitative team had spent months optimizing the math—better signal processing, smarter predictions. But the system was still 10x slower than competitors.
The problem? Their feature matrix was stored column-major (Fortran-style), but their code iterated row-by-row. Every single data access caused a cache miss. The CPU spent more time fetching data than computing.
The fix was two characters: changing np.array(data, order='F') to np.array(data, order='C'). Trading latency dropped from 50ms to 5ms. Over a quarter, this translated to approximately $3 million in additional profits from capturing price movements faster.
Lesson learned: Profile your code. The bottleneck is almost never where you think it is.
pandas: Data Manipulation Made Easy
Section titled “pandas: Data Manipulation Made Easy”What is pandas?
Section titled “What is pandas?”pandas provides:
- DataFrame: 2D labeled data structure (like a spreadsheet)
- Series: 1D labeled array (like a column)
- Data I/O: Read/write CSV, Excel, SQL, JSON, Parquet
- Data cleaning: Handle missing data, duplicates, transformations
- Grouping and aggregation: SQL-like operations
- Time series: Date/time handling, resampling
Why This Module Matters
Section titled “Why This Module Matters”Before you train a model, you spend 80% of your time:
- Loading and exploring data
- Cleaning and preprocessing
- Feature engineering
- Splitting and validating
pandas makes all of this 10x easier.
Core pandas Concepts
Section titled “Core pandas Concepts”1. Series and DataFrames
Section titled “1. Series and DataFrames”import pandas as pd
# Series: 1D labeled arrays = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])s['a'] # 1s[['a', 'c']] # Series with a and c
# DataFrame: 2D labeled data structuredf = pd.DataFrame({ 'name': ['Alice', 'Bob', 'Charlie'], 'age': [25, 30, 35], 'salary': [50000, 60000, 70000]})
# name age salary# 0 Alice 25 50000# 1 Bob 30 60000# 2 Charlie 35 700002. Reading and Writing Data
Section titled “2. Reading and Writing Data”# CSVdf = pd.read_csv('data.csv')df.to_csv('output.csv', index=False)
# Exceldf = pd.read_excel('data.xlsx', sheet_name='Sheet1')df.to_excel('output.xlsx', index=False)
# JSONdf = pd.read_json('data.json')df.to_json('output.json', orient='records')
# SQLimport sqlite3conn = sqlite3.connect('database.db')df = pd.read_sql('SELECT * FROM users', conn)df.to_sql('users', conn, if_exists='replace', index=False)
# Parquet (efficient columnar format)df = pd.read_parquet('data.parquet')df.to_parquet('output.parquet')3. Exploring Data
Section titled “3. Exploring Data”# Quick overviewdf.head() # First 5 rowsdf.tail() # Last 5 rowsdf.shape # (rows, columns)df.info() # Column types, non-null countsdf.describe() # Statistical summary
# Column informationdf.columns # Column namesdf.dtypes # Data typesdf['age'].unique() # Unique valuesdf['age'].nunique() # Number of unique valuesdf['age'].value_counts() # Frequency of each value
# Memory usagedf.memory_usage(deep=True)4. Selecting Data
Section titled “4. Selecting Data”# Column selectiondf['name'] # Single column (Series)df[['name', 'age']] # Multiple columns (DataFrame)
# Row selection with .loc (label-based)df.loc[0] # Row with index 0df.loc[0:2] # Rows 0 through 2 (inclusive!)df.loc[0, 'name'] # Specific celldf.loc[:, 'name':'salary'] # All rows, columns name to salary
# Row selection with .iloc (integer-based)df.iloc[0] # First rowdf.iloc[0:2] # First two rows (exclusive!)df.iloc[0, 0] # First celldf.iloc[:, 0:2] # All rows, first two columns
# Boolean selectiondf[df['age'] > 25] # Rows where age > 25df[(df['age'] > 25) & (df['salary'] > 55000)] # Multiple conditionsdf.query('age > 25 and salary > 55000') # Same thing, cleaner5. Data Cleaning
Section titled “5. Data Cleaning”# Missing datadf.isna() # Boolean mask of missing valuesdf.isna().sum() # Count missing per columndf.dropna() # Drop rows with any missingdf.dropna(subset=['age']) # Drop rows missing agedf.fillna(0) # Fill missing with 0df.fillna(df.mean()) # Fill with column meansdf.fillna(method='ffill') # Forward fill
# Duplicatesdf.duplicated() # Boolean maskdf.drop_duplicates() # Remove duplicatesdf.drop_duplicates(subset=['name']) # By specific columns
# Data typesdf['age'] = df['age'].astype(int)df['date'] = pd.to_datetime(df['date'])df['category'] = df['category'].astype('category')
# String operationsdf['name'].str.lower()df['name'].str.strip()df['name'].str.contains('A')df['name'].str.replace('Alice', 'Alicia')6. Transforming Data
Section titled “6. Transforming Data”# Apply functionsdf['age_squared'] = df['age'] ** 2df['age_category'] = df['age'].apply(lambda x: 'young' if x < 30 else 'old')df['full_info'] = df.apply(lambda row: f"{row['name']}: {row['age']}", axis=1)
# Map valuesdf['grade'] = df['score'].map({90: 'A', 80: 'B', 70: 'C'})
# Binningdf['age_group'] = pd.cut(df['age'], bins=[0, 25, 35, 100], labels=['young', 'mid', 'senior'])
# One-hot encodingpd.get_dummies(df, columns=['category'])
# Renamingdf.rename(columns={'name': 'full_name'})df.columns = ['Name', 'Age', 'Salary'] # Rename all7. Grouping and Aggregation
Section titled “7. Grouping and Aggregation”# Group by single columndf.groupby('department')['salary'].mean()df.groupby('department')['salary'].agg(['mean', 'min', 'max'])
# Group by multiple columnsdf.groupby(['department', 'level'])['salary'].mean()
# Multiple aggregationsdf.groupby('department').agg({ 'salary': ['mean', 'median'], 'age': 'mean', 'name': 'count'})
# Transform (return same shape)df['salary_zscore'] = df.groupby('department')['salary'].transform( lambda x: (x - x.mean()) / x.std())
# Filter groupsdf.groupby('department').filter(lambda x: len(x) > 5)8. Merging and Joining
Section titled “8. Merging and Joining”# Merge (SQL-like joins)merged = pd.merge(df1, df2, on='id') # Inner joinmerged = pd.merge(df1, df2, on='id', how='left') # Left joinmerged = pd.merge(df1, df2, on='id', how='outer') # Outer joinmerged = pd.merge(df1, df2, left_on='user_id', right_on='id') # Different names
# Concat (stack DataFrames)combined = pd.concat([df1, df2]) # Verticallycombined = pd.concat([df1, df2], axis=1) # Horizontally
# Join (on index)df1.join(df2, how='left')9. Pivot Tables and Reshaping
Section titled “9. Pivot Tables and Reshaping”# Pivot tablepivot = df.pivot_table( values='sales', index='region', columns='product', aggfunc='sum')
# Melt (unpivot)melted = pd.melt(df, id_vars=['date'], value_vars=['product_a', 'product_b'])
# Stack/unstackstacked = df.stack()unstacked = df.unstack()Did You Know? pandas Performance Secrets
Section titled “Did You Know? pandas Performance Secrets”The chained indexing trap:
# BAD - Creates copy, may not modify originaldf[df['age'] > 25]['salary'] = 100000 # SettingWithCopyWarning!
# GOOD - Use .loc for assignmentdf.loc[df['age'] > 25, 'salary'] = 100000Categorical data: If you have a column with repeated string values, convert to category:
df['status'] = df['status'].astype('category')# Memory: 1M strings "active"/"inactive" = 50MB → 2MB (25x smaller!)Arrow and pandas 2.0: pandas 2.0 (2023) can use Apache Arrow as the backend instead of NumPy. Arrow is faster for string operations and uses less memory:
df = pd.read_csv('data.csv', dtype_backend='pyarrow')Visualization: matplotlib and seaborn
Section titled “Visualization: matplotlib and seaborn”matplotlib: The Grandfather of Python Plotting
Section titled “matplotlib: The Grandfather of Python Plotting”matplotlib gives you complete control over every aspect of a plot. It’s verbose but powerful.
Basic Plotting
Section titled “Basic Plotting”import matplotlib.pyplot as pltimport numpy as np
# Line plotx = np.linspace(0, 10, 100)y = np.sin(x)
plt.figure(figsize=(10, 6))plt.plot(x, y, label='sin(x)', color='blue', linewidth=2)plt.xlabel('X axis')plt.ylabel('Y axis')plt.title('Sine Wave')plt.legend()plt.grid(True, alpha=0.3)plt.savefig('sine.png', dpi=150)plt.show()The Object-Oriented Interface
Section titled “The Object-Oriented Interface”For complex plots, use the OO interface:
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Top-left: Line plotaxes[0, 0].plot(x, np.sin(x), 'b-', label='sin')axes[0, 0].plot(x, np.cos(x), 'r--', label='cos')axes[0, 0].legend()axes[0, 0].set_title('Trigonometric Functions')
# Top-right: Scatter plotaxes[0, 1].scatter(np.random.rand(50), np.random.rand(50), c=np.random.rand(50), s=np.random.rand(50)*500, alpha=0.6, cmap='viridis')axes[0, 1].set_title('Scatter Plot')
# Bottom-left: Bar plotcategories = ['A', 'B', 'C', 'D']values = [23, 45, 56, 78]axes[1, 0].bar(categories, values, color='steelblue')axes[1, 0].set_title('Bar Chart')
# Bottom-right: Histogramdata = np.random.randn(1000)axes[1, 1].hist(data, bins=30, edgecolor='black', alpha=0.7)axes[1, 1].set_title('Histogram')
plt.tight_layout()plt.savefig('subplots.png', dpi=150)plt.show()Common Plot Types
Section titled “Common Plot Types”# Scatter plotplt.scatter(x, y, c=colors, s=sizes, alpha=0.6, cmap='viridis')
# Bar plotplt.bar(categories, values)plt.barh(categories, values) # Horizontal
# Histogramplt.hist(data, bins=30, density=True) # density=True normalizes
# Box plotplt.boxplot([data1, data2, data3], labels=['A', 'B', 'C'])
# Pie chart (use sparingly!)plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90)
# Heatmapplt.imshow(matrix, cmap='hot', aspect='auto')plt.colorbar()
# Contour plotplt.contour(X, Y, Z, levels=20)plt.contourf(X, Y, Z, levels=20, cmap='viridis') # Filledseaborn: Statistical Visualization
Section titled “seaborn: Statistical Visualization”seaborn builds on matplotlib with:
- Beautiful default styles
- Statistical visualization functions
- Integration with pandas DataFrames
- Color palettes designed for data
import seaborn as sns
# Set stylesns.set_theme(style='whitegrid')
# Distribution plotssns.histplot(df['age'], kde=True) # Histogram with KDEsns.kdeplot(df['age']) # Just KDEsns.boxplot(x='department', y='salary', data=df)sns.violinplot(x='department', y='salary', data=df)
# Relationship plotssns.scatterplot(x='age', y='salary', hue='department', data=df)sns.lineplot(x='date', y='value', hue='category', data=df)sns.regplot(x='age', y='salary', data=df) # With regression line
# Categorical plotssns.countplot(x='department', data=df)sns.barplot(x='department', y='salary', data=df) # With error barssns.catplot(x='department', y='salary', kind='violin', data=df)
# Matrix plotssns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')sns.clustermap(data_matrix) # Hierarchical clustering
# Pair plots (for exploring relationships)sns.pairplot(df[['age', 'salary', 'experience']], hue='department')
# Joint plots (2D + marginal distributions)sns.jointplot(x='age', y='salary', data=df, kind='hex')Visualization Best Practices for ML
Section titled “Visualization Best Practices for ML”1. Exploring Your Data
Section titled “1. Exploring Your Data”# Quick overview of distributionsfig, axes = plt.subplots(2, 3, figsize=(15, 10))for idx, col in enumerate(df.select_dtypes(include=[np.number]).columns): ax = axes[idx // 3, idx % 3] df[col].hist(ax=ax, bins=30) ax.set_title(col)plt.tight_layout()2. Correlation Analysis
Section titled “2. Correlation Analysis”# Correlation heatmapcorr = df.corr()plt.figure(figsize=(12, 10))sns.heatmap(corr, annot=True, cmap='coolwarm', center=0, square=True, linewidths=0.5)plt.title('Feature Correlations')3. Target Distribution
Section titled “3. Target Distribution”# For classificationfig, axes = plt.subplots(1, 2, figsize=(12, 5))df['target'].value_counts().plot(kind='bar', ax=axes[0])axes[0].set_title('Target Distribution')df['target'].value_counts().plot(kind='pie', autopct='%1.1f%%', ax=axes[1])4. Feature-Target Relationships
Section titled “4. Feature-Target Relationships”# Numerical feature vs targetfig, axes = plt.subplots(1, len(numerical_cols), figsize=(15, 5))for idx, col in enumerate(numerical_cols): sns.boxplot(x='target', y=col, data=df, ax=axes[idx]) axes[idx].set_title(f'{col} by Target')Did You Know? The Art of Data Visualization
Section titled “Did You Know? The Art of Data Visualization”Edward Tufte’s principles (the godfather of data viz):
- Data-ink ratio: Maximize data, minimize chart junk
- Small multiples: Same plot repeated for different subsets
- Integrity: Don’t mislead with scales or cherry-picking
Color blindness: ~8% of men are colorblind. Use:
cmap='viridis'(perceptually uniform, colorblind-friendly)cmap='cividis'(optimized for colorblindness)- Or use different line styles/markers
The pie chart problem: Humans are bad at comparing angles. Use bar charts instead:
# BADplt.pie(values, labels=labels)
# GOODplt.barh(labels, values)Common Mistakes and How to Avoid Them
Section titled “Common Mistakes and How to Avoid Them”Mistake 1: The Silent Copy Problem
Section titled “Mistake 1: The Silent Copy Problem”One of pandas’ most confusing behaviors: sometimes operations return views, sometimes copies. This leads to SettingWithCopyWarning and silent bugs.
# WRONG - May not modify original DataFramedf_filtered = df[df['age'] > 30]df_filtered['status'] = 'senior' # Warning! Might not update df
# RIGHT - Explicitly chain .copy() or use .locdf_filtered = df[df['age'] > 30].copy()df_filtered['status'] = 'senior' # Now modifying an explicit copy
# RIGHT - Use .loc for in-place modificationdf.loc[df['age'] > 30, 'status'] = 'senior' # Directly modifies dfWhy this happens: pandas tries to be memory-efficient by returning views when possible. But the rules for when you get a view vs. copy are complex and have changed across versions.
Best practice: In pandas 2.0+, use copy_on_write mode:
pd.options.mode.copy_on_write = TrueMistake 2: Forgetting axis in NumPy
Section titled “Mistake 2: Forgetting axis in NumPy”NumPy’s axis parameter is counterintuitive for beginners:
arr = np.array([[1, 2, 3], [4, 5, 6]])
# CONFUSING: axis=0 sums along rows (collapses rows, keeps columns)arr.sum(axis=0) # array([5, 7, 9]) - sum of each column
# axis=1 sums along columns (collapses columns, keeps rows)arr.sum(axis=1) # array([6, 15]) - sum of each rowMemory trick: axis=0 operates across the first dimension (rows), collapsing them. axis=1 operates across the second dimension (columns), collapsing them. The axis you specify is the one that disappears.
Mistake 3: Using Python Loops for Vectorizable Operations
Section titled “Mistake 3: Using Python Loops for Vectorizable Operations”# WRONG - 100x slowerresult = []for i in range(len(df)): result.append(df.iloc[i]['price'] * df.iloc[i]['quantity'])df['total'] = result
# RIGHT - Vectorizeddf['total'] = df['price'] * df['quantity']
# WRONG - Using apply when vectorization worksdf['sqrt_price'] = df['price'].apply(lambda x: np.sqrt(x))
# RIGHT - Direct NumPy operationdf['sqrt_price'] = np.sqrt(df['price'])Rule of thumb: If you’re writing a loop over DataFrame rows, you’re probably doing it wrong. Look for vectorized alternatives first.
Mistake 4: Not Handling dtypes Properly
Section titled “Mistake 4: Not Handling dtypes Properly”# WRONG - Mixed types cause problemsdf = pd.DataFrame({'id': ['1', '2', '3'], 'value': ['10.5', '20.3', None]})df['value'].mean() # TypeError!
# RIGHT - Convert types explicitlydf['id'] = df['id'].astype(int)df['value'] = pd.to_numeric(df['value'], errors='coerce') # None → NaNdf['value'].mean() # Works: 15.4Always check dtypes first: df.dtypes should be your first command after loading data.
Mistake 5: Memory Explosion with String Columns
Section titled “Mistake 5: Memory Explosion with String Columns”# WRONG - Default string storage is memory-intensivedf = pd.read_csv('large_file.csv') # 10M rows with status column# 'status' has values: 'active', 'inactive', 'pending'# Memory: ~800 MB just for this column
# RIGHT - Use categorical dtypedf['status'] = df['status'].astype('category')# Memory: ~40 MB (20x reduction!)
# Even better - specify on readdf = pd.read_csv('large_file.csv', dtype={'status': 'category'})Production War Stories: pandas at Scale
Section titled “Production War Stories: pandas at Scale”The Data Scientist Who Crashed Production
Section titled “The Data Scientist Who Crashed Production”A data scientist at a fintech company wrote a feature engineering pipeline that worked perfectly in development. On 100,000 rows, it ran in 3 seconds.
In production, with 50 million rows, the system crashed. Not slow—crashed. The server ran out of memory.
The culprit? A single line: df.apply(custom_function, axis=1). The function created intermediate strings, and pandas kept them all in memory. 50 million rows × 200 bytes per string = 10 GB memory spike.
The fix: Vectorization.
# BEFORE: apply() creating millions of temporary objectsdef calculate_risk(row): if row['amount'] > 10000: return f"HIGH:{row['category']}" else: return f"LOW:{row['category']}"
df['risk'] = df.apply(calculate_risk, axis=1) # 10 GB memory, 45 minutes
# AFTER: Vectorized with np.wheredf['risk'] = np.where( df['amount'] > 10000, 'HIGH:' + df['category'].astype(str), 'LOW:' + df['category'].astype(str)) # 200 MB memory, 3 secondsThe Billion-Row Challenge
Section titled “The Billion-Row Challenge”Netflix’s data team needed to process viewer engagement data—1.5 billion rows of watch events. Traditional pandas couldn’t handle it (it would need 200+ GB RAM).
Their solution: chunked processing with pandas.
def process_large_file(filename, chunksize=1_000_000): results = []
for chunk in pd.read_csv(filename, chunksize=chunksize): # Process each million-row chunk chunk_result = chunk.groupby('user_id').agg({ 'watch_time': 'sum', 'title_id': 'nunique' }) results.append(chunk_result)
# Combine chunk results final = pd.concat(results).groupby(level=0).sum() return finalFor truly massive data, they moved to Dask—pandas-like syntax that automatically parallelizes across CPU cores and machines.
Why Google Rewrote TensorFlow’s Data Pipeline
Section titled “Why Google Rewrote TensorFlow’s Data Pipeline”TensorFlow 1.x had a data loading bottleneck. The GPU could train on a batch in 10ms, but loading the next batch from disk took 50ms. The GPU sat idle 80% of the time.
Google’s fix: tf.data, a pipeline that prefetches data while the GPU computes. The secret sauce? It uses the same memory layout principles as NumPy—contiguous arrays that can be transferred to GPU memory efficiently.
** Did You Know?**
The largest pandas DataFrame ever created (that we know of) was at Jane Street Capital—a quantitative trading firm. Their tick-by-tick market data DataFrame contained 4.2 billion rows representing every trade on US exchanges for a year. They used chunked loading, memory-mapped files, and 2TB of RAM. Processing a single backtest query against this DataFrame took 12 minutes. After converting to Apache Arrow format with Polars (a Rust-based DataFrame library), the same query took 8 seconds.
Putting It All Together: The ML Data Pipeline
Section titled “Putting It All Together: The ML Data Pipeline”Here’s how these tools work together in a real ML workflow:
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import StandardScaler
# 1. Load datadf = pd.read_csv('dataset.csv')
# 2. Exploreprint(df.info())print(df.describe())print(df.isnull().sum())
# 3. Visualizefig, axes = plt.subplots(2, 2, figsize=(12, 10))sns.histplot(df['target'], ax=axes[0, 0])sns.heatmap(df.corr(), ax=axes[0, 1], cmap='coolwarm')sns.boxplot(x='category', y='feature', data=df, ax=axes[1, 0])sns.scatterplot(x='feature1', y='feature2', hue='target', data=df, ax=axes[1, 1])plt.tight_layout()
# 4. Cleandf = df.dropna(subset=['important_column'])df = df.fillna(df.median())df = df.drop_duplicates()
# 5. Transformdf['log_feature'] = np.log1p(df['skewed_feature'])df = pd.get_dummies(df, columns=['category'])
# 6. SplitX = df.drop('target', axis=1)y = df['target']X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# 7. Scale (using NumPy-backed transformations)scaler = StandardScaler()X_train_scaled = scaler.fit_transform(X_train)X_test_scaled = scaler.transform(X_test)
# Now ready for ML!Hands-On Practical Exercises
Section titled “Hands-On Practical Exercises”Exercise 1: NumPy Fundamentals
Section titled “Exercise 1: NumPy Fundamentals”Goal: Master array operations and understand memory efficiency.
Step-by-Step Instructions:
- Create a 10x10 matrix of random integers (0-100):
import numpy as npnp.random.seed(42) # For reproducibilitymatrix = np.random.randint(0, 100, size=(10, 10))print("Original matrix shape:", matrix.shape)print(matrix)- Find the mean of each row and column:
row_means = matrix.mean(axis=1) # Mean across columns (for each row)col_means = matrix.mean(axis=0) # Mean across rows (for each column)print(f"Row means: {row_means}")print(f"Column means: {col_means}")- Replace all values > 50 with 50 (clipping):
# Using np.clip is the most efficientclipped = np.clip(matrix, 0, 50)
# Alternative using boolean indexingmatrix_copy = matrix.copy()matrix_copy[matrix_copy > 50] = 50- Find the indices of the 5 largest values:
# Flatten, sort, and get indicesflat_indices = np.argsort(matrix.ravel())[-5:][::-1]# Convert to 2D indicesrow_indices, col_indices = np.unravel_index(flat_indices, matrix.shape)print("Top 5 values:", matrix[row_indices, col_indices])print("Locations:", list(zip(row_indices, col_indices)))- Multiply the matrix by its transpose and calculate determinant:
result = matrix @ matrix.T # Matrix multiplicationprint("Result shape:", result.shape) # Should be 10x10
# Determinant (works only for square matrices)det = np.linalg.det(result)print(f"Determinant: {det:.2e}") # Often very large!Expected Results: You should see the determinant is extremely large (or small) because matrix multiplication amplifies eigenvalues.
Success Criteria: Complete all 5 operations without using Python loops. Each operation should be a single line of NumPy code.
Exercise 2: pandas Data Analysis
Section titled “Exercise 2: pandas Data Analysis”Goal: Build a complete data analysis pipeline from raw data to insights.
Step-by-Step Instructions:
- Load the Titanic dataset:
import pandas as pdimport seaborn as sns
# Load from seaborn's built-in datasetsdf = sns.load_dataset('titanic')print(df.shape)print(df.info())- Show basic statistics and identify issues:
# Numerical summaryprint(df.describe())
# Missing valuesmissing = df.isnull().sum()missing_pct = (missing / len(df) * 100).round(1)print(pd.DataFrame({'missing': missing, 'percent': missing_pct}))- Handle missing values strategically:
# Age: Fill with median (age distributions are often skewed)df['age'] = df['age'].fillna(df['age'].median())
# Embarked: Fill with mode (most common value)df['embarked'] = df['embarked'].fillna(df['embarked'].mode()[0])
# Deck: Too many missing (77%+), might drop or create 'Unknown'df['deck'] = df['deck'].fillna('Unknown')- Group by and aggregate:
# Survival rate by class and sexsurvival_analysis = df.groupby(['pclass', 'sex']).agg({ 'survived': ['mean', 'sum', 'count']}).round(3)print(survival_analysis)
# Key insight: Women in 1st class survived 97% of the time- Create new features:
# Family sizedf['family_size'] = df['sibsp'] + df['parch'] + 1
# Is alonedf['is_alone'] = (df['family_size'] == 1).astype(int)
# Age categoriesdf['age_group'] = pd.cut(df['age'], bins=[0, 12, 20, 40, 60, 100], labels=['Child', 'Teen', 'Adult', 'Middle', 'Senior'])
# Fare per persondf['fare_per_person'] = df['fare'] / df['family_size']- Find correlations:
# Select only numeric columns for correlationnumeric_cols = df.select_dtypes(include=[np.number]).columnscorrelations = df[numeric_cols].corr()['survived'].sort_values(ascending=False)print("Correlations with survival:")print(correlations)Expected Results: You should find that fare and pclass are strongly correlated with survival (positive and negative respectively). Being alone (is_alone) slightly decreases survival chances.
Success Criteria: Complete the analysis and identify at least 3 meaningful insights about survival factors.
Exercise 3: Visualization Dashboard
Section titled “Exercise 3: Visualization Dashboard”Goal: Create a publication-quality visualization dashboard.
Step-by-Step Instructions:
import matplotlib.pyplot as pltimport seaborn as snsimport numpy as np
# Set stylesns.set_theme(style='whitegrid', palette='deep')fig = plt.figure(figsize=(16, 12))
# 1. Target distributionax1 = fig.add_subplot(2, 3, 1)df['survived'].value_counts().plot(kind='bar', ax=ax1, color=['#ff6b6b', '#4ecdc4'])ax1.set_title('Survival Distribution')ax1.set_xticklabels(['Died', 'Survived'], rotation=0)ax1.set_ylabel('Count')
# 2. Correlation heatmapax2 = fig.add_subplot(2, 3, 2)numeric_df = df.select_dtypes(include=[np.number])corr = numeric_df.corr()mask = np.triu(np.ones_like(corr, dtype=bool))sns.heatmap(corr, mask=mask, annot=True, fmt='.2f', cmap='coolwarm', center=0, ax=ax2, square=True)ax2.set_title('Feature Correlations')
# 3. Box plots by survivalax3 = fig.add_subplot(2, 3, 3)df.boxplot(column='age', by='survived', ax=ax3)ax3.set_title('Age Distribution by Survival')ax3.set_xlabel('Survived')plt.suptitle('') # Remove automatic title
# 4. Scatter plot of correlated featuresax4 = fig.add_subplot(2, 3, 4)sns.scatterplot(x='fare', y='age', hue='survived', data=df, alpha=0.6, ax=ax4)ax4.set_title('Fare vs Age (colored by survival)')
# 5. Bar chart of survival by classax5 = fig.add_subplot(2, 3, 5)survival_by_class = df.groupby('pclass')['survived'].mean()survival_by_class.plot(kind='bar', ax=ax5, color='steelblue')ax5.set_title('Survival Rate by Class')ax5.set_ylabel('Survival Rate')ax5.set_xticklabels(['1st', '2nd', '3rd'], rotation=0)
# 6. Pair plot (simplified version for subplot)ax6 = fig.add_subplot(2, 3, 6)sns.histplot(data=df, x='fare', hue='survived', bins=30, ax=ax6, kde=True)ax6.set_title('Fare Distribution by Survival')ax6.set_xlim(0, 200) # Cap for readability
plt.tight_layout()plt.savefig('titanic_dashboard.png', dpi=150, bbox_inches='tight')plt.show()Expected Results: A clean 2x3 dashboard showing survival patterns across different features.
Success Criteria: All 6 plots should be properly labeled, use consistent colors, and tell a coherent story about Titanic survival factors.
Deliverables
Section titled “Deliverables”For this module, you will build:
1. NumPy Performance Benchmark
Section titled “1. NumPy Performance Benchmark”- Compare NumPy vs Python loops for common operations
- Measure impact of vectorization
- Document speedup factors
2. Data Analysis Pipeline
Section titled “2. Data Analysis Pipeline”- Load, clean, and transform a dataset
- Feature engineering examples
- Export processed data
3. ML Data Toolkit (Main Deliverable)
Section titled “3. ML Data Toolkit (Main Deliverable)”- Reusable functions for:
- Data loading and exploration
- Cleaning and preprocessing
- Feature engineering
- Train/test splitting
- Visualization generation
- CLI interface for quick analysis
- JSON configuration support
The Future of the Python ML Ecosystem
Section titled “The Future of the Python ML Ecosystem”The Polars Revolution
Section titled “The Polars Revolution”In 2020, Ritchie Vink, a Dutch data engineer, started writing a DataFrame library in Rust. He called it Polars, and it’s shaking up the pandas world.
Polars is:
- 10-100x faster than pandas for many operations
- Memory efficient: Uses Apache Arrow columnar format
- Lazy evaluation: Optimizes query plans before execution
- Multi-threaded by default: Uses all CPU cores automatically
import polars as pl
# Polars syntax is similar to pandas but with lazy evaluationdf = pl.scan_csv('large_file.csv') # Doesn't read yetresult = ( df.filter(pl.col('amount') > 1000) .groupby('category') .agg([ pl.col('amount').sum().alias('total'), pl.col('amount').mean().alias('avg') ]) .collect() # NOW it reads and processes)Should you switch? For new projects with big data, strongly consider Polars. For existing pandas codebases, the migration cost may not be worth it. The pandas team is actively adding performance improvements (pandas 2.0 with Arrow backend closes much of the gap).
NumPy 2.0: The Future
Section titled “NumPy 2.0: The Future”NumPy 2.0 (released June 2024) brought major changes:
- String dtype: Proper variable-length strings (no more fixed-width object arrays)
- Copy semantics: More predictable when views vs copies are returned
- NEP 50: Promotion rules that match Python scalar behavior
- Removed deprecated features: Breaking changes for cleaner API
Most importantly, NumPy 2.0 maintains backward compatibility for well-written code. If your code breaks, it was probably relying on undocumented behavior.
JAX: NumPy for GPUs and TPUs
Section titled “JAX: NumPy for GPUs and TPUs”Google’s JAX is “NumPy that runs on GPUs”:
import jax.numpy as jnp
# This looks like NumPy...def neural_network(weights, x): return jnp.tanh(jnp.dot(weights, x))
# But runs on GPU and can auto-differentiate!from jax import gradgradient_fn = grad(neural_network)JAX is increasingly used in cutting-edge ML research because:
- Same NumPy API developers already know
- Automatic differentiation (no manual gradient code)
- JIT compilation to XLA (extreme speed)
- Easy GPU/TPU parallelization
The RAPIDS Ecosystem
Section titled “The RAPIDS Ecosystem”NVIDIA’s RAPIDS project brings pandas/NumPy APIs to GPUs:
import cudf # GPU DataFrameimport cupy # GPU NumPy
# Your pandas code... but on GPUgdf = cudf.read_csv('huge_file.csv') # Loaded to GPU memoryresult = gdf.groupby('category')['value'].mean() # Computed on GPUFor datasets that fit in GPU memory, RAPIDS can be 50-100x faster. The catch: you need an NVIDIA GPU, and not all pandas features are supported.
The Data Science Stack in 2025
Section titled “The Data Science Stack in 2025”Here’s what the modern Python ML stack looks like:
┌─────────────────────────────────────────────────────────────┐│ Your ML Code │├─────────────────────────────────────────────────────────────┤│ High-Level ML │ scikit-learn, XGBoost, LightGBM │├─────────────────────────────────────────────────────────────┤│ Deep Learning │ PyTorch, TensorFlow, JAX │├─────────────────────────────────────────────────────────────┤│ DataFrames │ pandas (CPU) / Polars / cuDF (GPU) │├─────────────────────────────────────────────────────────────┤│ Arrays │ NumPy (CPU) / CuPy (GPU) / JAX │├─────────────────────────────────────────────────────────────┤│ Storage Format │ Apache Arrow / Parquet │├─────────────────────────────────────────────────────────────┤│ Low-Level Compute │ BLAS/LAPACK (CPU) / cuBLAS (GPU) │└─────────────────────────────────────────────────────────────┘The key insight: Apache Arrow is becoming the universal interchange format. Polars, pandas 2.0, RAPIDS, and Spark all use Arrow internally, meaning you can pass data between them without copying.
** Did You Know?**
The transition from pandas 1.x to 2.0 took four years of development. The main challenge wasn’t adding new features—it was maintaining backward compatibility with the millions of lines of pandas code running in production worldwide. The pandas team estimated that breaking backward compatibility would affect $50 billion worth of financial models running on Wall Street alone. They chose to make 2.0 largely compatible, disappointing some who wanted a cleaner break.
Performance Benchmarks: Know Your Tools
Section titled “Performance Benchmarks: Know Your Tools”Understanding relative performance helps you choose the right tool:
Operation Speed Comparison (1 million rows)
Section titled “Operation Speed Comparison (1 million rows)”| Operation | Python Loop | NumPy | pandas | Polars |
|---|---|---|---|---|
| Sum column | 1500ms | 2ms | 3ms | 1ms |
| Filter rows | 2000ms | 15ms | 25ms | 5ms |
| Group by | N/A | N/A | 120ms | 20ms |
| Join tables | N/A | N/A | 200ms | 30ms |
| Sort | 8000ms | 80ms | 150ms | 40ms |
Memory Usage Comparison (1 million rows, 10 columns)
Section titled “Memory Usage Comparison (1 million rows, 10 columns)”| Format | Memory |
|---|---|
| Python list of dicts | ~800 MB |
| pandas (object dtype) | ~400 MB |
| pandas (optimized dtypes) | ~80 MB |
| Polars | ~60 MB |
| NumPy (float64) | ~80 MB |
When to Use What
Section titled “When to Use What”Use Python loops when:
- Processing < 1000 items
- Complex branching logic that can’t be vectorized
- One-time scripts where readability matters more than speed
Use NumPy when:
- Pure numerical computation
- Matrix/tensor operations
- Interfacing with C/Fortran libraries
- Building custom ML algorithms
Use pandas when:
- Exploratory data analysis
- Mixed data types (strings, dates, numbers)
- Need SQL-like operations (groupby, merge)
- Working with time series
Use Polars when:
- Large datasets (> 1 million rows)
- Performance-critical pipelines
- Building new projects without pandas legacy code
- Need parallel processing
Summary: Your ML Toolkit
Section titled “Summary: Your ML Toolkit”You’ve now mastered the three pillars of Python’s ML ecosystem:
NumPy is the foundation—understand arrays, vectorization, and memory layout. Every other library builds on these concepts.
pandas is your data wrangling workhorse—clean, transform, and explore data before feeding it to models. Learn to avoid loops and embrace vectorization.
matplotlib/seaborn are your visualization tools—always plot your data. The human eye catches patterns that statistical tests miss.
Key Takeaways:
-
Vectorization is everything: Replace Python loops with NumPy/pandas operations. The speedup is 100-1000x.
-
Memory layout matters: Contiguous arrays (NumPy) beat scattered objects (Python lists). Understanding this explains half of ML performance issues.
-
Know your dtypes: Use appropriate data types (int32 vs int64, category vs object). Memory and speed improvements are dramatic.
-
Profile before optimizing: The bottleneck is rarely where you think. Use
%timeitand memory profilers. -
The ecosystem is evolving: Polars, JAX, and Arrow are reshaping the landscape. Stay curious about new tools, but master the fundamentals first.
With these tools, you’re ready to build neural networks from scratch in Module 26. The NumPy operations you’ve learned—matrix multiplication, broadcasting, vectorization—are exactly what neural networks need.
Further Reading
Section titled “Further Reading”- NumPy User Guide
- From Python to NumPy - Free book by Nicolas Rougier
- 100 NumPy Exercises
pandas
Section titled “pandas”- pandas User Guide
- Python for Data Analysis - Free book by Wes McKinney
- Modern Pandas - Best practices
Visualization
Section titled “Visualization”- matplotlib Tutorials
- seaborn Tutorial
- The Visual Display of Quantitative Information - Tufte’s classic
Scientific Python
Section titled “Scientific Python”Interview Prep: What Employers Want to See
Section titled “Interview Prep: What Employers Want to See”If you’re preparing for data science or ML engineering interviews, NumPy and pandas proficiency is non-negotiable. Here’s what interviewers look for:
Common Technical Questions
Section titled “Common Technical Questions”NumPy Questions:
- “How would you normalize a 2D array?”
- Answer:
(arr - arr.mean(axis=0)) / arr.std(axis=0)for column-wise normalization
- Answer:
- “What’s the difference between
.reshape()and.resize()?”- Answer:
reshapereturns a view (if possible) without modifying data;resizemodifies in-place and can change size
- Answer:
- “How do you find the most common value in an array?”
- Answer:
np.bincount(arr).argmax()for integers, ornp.unique(arr, return_counts=True)for general cases
- Answer:
pandas Questions:
- “How do you efficiently iterate over a DataFrame?”
- Answer: “I don’t. I use vectorized operations. If I must iterate, I use
.itertuples()not.iterrows()—it’s 100x faster.”
- Answer: “I don’t. I use vectorized operations. If I must iterate, I use
- “What’s the difference between
.locand.iloc?”- Answer:
.locis label-based;.ilocis integer position-based. They have different slicing behavior (.locis inclusive on both ends).
- Answer:
- “How would you handle a 50GB CSV file?”
- Answer: “Chunked reading with
pd.read_csv(chunksize=...), selecting only needed columns withusecols=, or switching to Polars/Dask for out-of-core processing.”
- Answer: “Chunked reading with
Live Coding Scenarios
Section titled “Live Coding Scenarios”Interviewers often give you a dataset and ask questions. Practice these patterns:
Data Cleaning Pipeline:
def clean_data(df): # 1. Check shape and types print(f"Shape: {df.shape}") print(f"Missing: {df.isnull().sum().sum()} values")
# 2. Handle missing values numeric_cols = df.select_dtypes(include=[np.number]).columns df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].median())
# 3. Remove duplicates df = df.drop_duplicates()
# 4. Fix data types for col in df.select_dtypes(include=['object']).columns: if df[col].nunique() < 20: df[col] = df[col].astype('category')
return dfFeature Engineering:
def create_features(df): # Time-based features (if datetime column exists) if 'date' in df.columns: df['date'] = pd.to_datetime(df['date']) df['day_of_week'] = df['date'].dt.dayofweek df['month'] = df['date'].dt.month df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
# Aggregation features if 'category' in df.columns and 'value' in df.columns: agg = df.groupby('category')['value'].transform('mean') df['value_vs_category_mean'] = df['value'] - agg
# Binning continuous features if 'age' in df.columns: df['age_group'] = pd.cut(df['age'], bins=[0, 18, 35, 55, 100], labels=['youth', 'young_adult', 'middle', 'senior'])
return dfRed Flags That Fail Interviews
Section titled “Red Flags That Fail Interviews”- Using loops when vectorization is possible: Shows you don’t understand pandas/NumPy
- Not checking for missing values first: Basic data hygiene
- Using
.apply()for simple operations: Shows lack of knowledge about vectorization - Ignoring data types: Memory and correctness issues
- Not exploring data before modeling: Jumping to models without understanding data
Green Flags That Impress
Section titled “Green Flags That Impress”- Knowing when to use
.loc[]vs.iloc[]vs boolean indexing - Understanding memory implications (category vs object, int32 vs int64)
- Using
df.info()anddf.describe()as first steps - Asking about data distribution before choosing fill strategies
- Mentioning performance alternatives (Polars, Dask) for large data
Did You Know? Fun Facts
Section titled “Did You Know? Fun Facts”The Billion-Dollar Bug
Section titled “The Billion-Dollar Bug”In 2012, Knight Capital lost $440 million in 45 minutes due to a trading algorithm bug. Post-mortem analysis was done entirely in pandas. The ability to quickly analyze millions of trades led to regulatory changes requiring better data analysis practices.
NumPy in Space
Section titled “NumPy in Space”NASA’s James Webb Space Telescope image processing pipeline uses NumPy. The famous first images required processing petabytes of data through NumPy arrays. When you see those stunning space images, you’re seeing NumPy at work.
pandas Named After Econometrics
Section titled “pandas Named After Econometrics”Despite what you might think, “pandas” isn’t named after the animal. It comes from “Panel Data” - a term from econometrics for multi-dimensional data structures. Wes McKinney was an economist before a programmer!
matplotlib’s Easter Eggs
Section titled “matplotlib’s Easter Eggs”matplotlib has hidden XKCD-style plotting:
with plt.xkcd(): plt.plot([1, 2, 3], [1, 4, 9]) plt.title('Much professional, very science')The Hadley Wickham Effect
Section titled “The Hadley Wickham Effect”Hadley Wickham created R’s ggplot2 and tidyverse. His influence on data science was so strong that Python libraries started copying his approach. seaborn’s “grammar of graphics” is directly inspired by ggplot2.
The Secret NumPy Test at Google
Section titled “The Secret NumPy Test at Google”Rumor has it that Google’s ML interview process included a secret NumPy competency test. Candidates who used Python loops instead of vectorized operations for a simple matrix problem were immediately flagged. The reasoning? If you don’t understand vectorization, you don’t understand how neural networks actually compute—and you’ll write training code that’s 100x slower than necessary.
While Google has likely evolved their interview process, the principle remains: NumPy proficiency is a proxy for understanding computational thinking at scale.
Travis Oliphant’s Dual Legacy
Section titled “Travis Oliphant’s Dual Legacy”Travis Oliphant didn’t just create NumPy. In 2012, he founded Continuum Analytics (now Anaconda), which created:
- Anaconda distribution: The most popular Python distribution for data science
- conda: Package manager that handles dependencies better than pip
- Numba: JIT compiler that makes Python code run at C speed
His work touches virtually every data scientist’s daily workflow. When you type conda install or import numpy, you’re using his creations.
The Matplotlib Rebellion
Section titled “The Matplotlib Rebellion”In 2013, a group of frustrated matplotlib users started a project called seaborn. They wanted “high-level statistical visualization”—defaults that were beautiful instead of ugly, APIs that were intuitive instead of verbose.
The project’s creator, Michael Waskom, was a neuroscience PhD student at Stanford. He built seaborn because he was tired of writing 50 lines of matplotlib code for simple plots. His frustration became the community’s solution.
Today, seaborn’s success has pushed matplotlib to improve. The gap between them has narrowed, but seaborn remains the go-to for quick statistical plots.
The $10 Million Library
Section titled “The $10 Million Library”In 2018, Wes McKinney estimated that pandas generates approximately $10 million per year in economic value from time saved by data scientists. With an estimated 5 million pandas users, each saving perhaps 100 hours per year compared to manual data manipulation, and valuing their time at $50/hour… the math adds up.
And pandas is free. This is the magic of open source: billions of dollars in value, available to anyone who types import pandas.
The Fortran Connection You Didn’t Know About
Section titled “The Fortran Connection You Didn’t Know About”Here’s a mind-bending fact: when you call np.dot() for matrix multiplication, your Python code is ultimately calling Fortran subroutines written in the 1970s.
BLAS (Basic Linear Algebra Subprograms) was created in 1979 by Charles Lawson, Richard Hanson, David Kincaid, and Fred Krogh. LAPACK followed in 1992. These libraries are so meticulously optimized—hand-tuned assembly code for specific CPU architectures—that no one has been able to beat them in 40+ years of trying.
Modern implementations like Intel MKL and OpenBLAS are essentially the same algorithms, just adapted for modern CPUs. When you train a neural network, those gradient descent matrix multiplications are flowing through code older than most programmers using them.
This is why NumPy is fast: it’s not doing the heavy lifting. It’s delegating to libraries that have been optimized across four decades by some of the brightest numerical computing minds in history.
Every time you multiply matrices in Python, you’re standing on the shoulders of computational giants who spent their careers making these operations as fast as physically possible.
Next Steps
Section titled “Next Steps”With NumPy, pandas, and visualization mastered, you’re ready for Module 26: Neural Networks from Scratch.
You’ll use these exact tools to:
- Create weight matrices with NumPy
- Track training metrics in pandas
- Visualize the learning process with matplotlib
The foundation is laid. Now let’s build neural networks!
Module 25 Complete! You now have the tools for machine learning.