Neural Networks with TensorFlow and Keras, Step by Step (Part 15)

Part 14 trained models on features that humans chose: someone decided petal length was worth measuring. Deep learning removes that someone. A neural network ingests raw data, pixels, audio samples, text, and learns its own internal features, layer by layer, edges becoming shapes becoming sleeves becoming "this is a shirt". That ability is why the last decade of AI happened, and this lesson is your honest, no-hand-waving introduction: what a neuron actually computes, how a network learns from its mistakes, and a complete Keras program that learns to read clothing photos at around 88 percent accuracy in a couple of minutes.

Set expectations like a professional. One lesson will not make you a deep learning engineer, and anyone who promises that is selling something. What one lesson can genuinely deliver: the four-piece mental model, neurons, loss, gradient descent, layers, that makes every tutorial, paper abstract, and job posting in this field readable, plus working code you ran and modified yourself. That foundation is real, it is durable, and it is exactly what Part 16 builds on when the networks grow into language models.

★

What you will learn in Part 15

What a single neuron computes: weights, bias, activation
Loss: a single number measuring how wrong the network is
Gradient descent: learning as thousands of tiny corrections
Layers in Keras: building a network like stacking Lego
Training a real image classifier on Fashion MNIST, end to end
Overfitting in networks, dropout, and the transfer learning shortcut

⚠

Warning

About running today's code

TensorFlow does not run inside the browser playground, so this lesson splits honestly: the playground below builds a working neural network in pure NumPy, every gear visible, while the Keras programs are presented as code to run on your machine (pip install tensorflow) or, easiest of all, free on Google Colab in your browser with zero setup.

1. One neuron, no mystery

Strip away the brain metaphors and a neuron is a small calculation you could do on paper: multiply each input by a weight, add them up, add a bias, then pass the total through an activation function. The weights are the knowledge, importance dials for each input, the bias shifts the threshold, and the activation adds the crucial nonlinearity; the modern default, ReLU, simply clips negatives to zero. One neuron is nearly powerless. The power is social: layer neurons so each consumes the outputs of the previous layer, and the network composes simple judgments into sophisticated ones, edges into textures into objects.

import numpy as np

def relu(x):
    return np.maximum(0, x)

inputs  = np.array([0.5, 0.8, 0.2])      # three feature values
weights = np.array([0.9, -0.4, 0.3])     # the neuron's "knowledge"
bias    = 0.1

raw = np.dot(inputs, weights) + bias     # weighted sum
out = relu(raw)                          # activation
print(f"raw {raw:.3f} -> activated {out:.3f}")

A layer is just many neurons sharing the same inputs, which collapses into one matrix multiplication, the reason GPUs, built to multiply matrices for graphics, became the engines of AI. A network is layers feeding layers; data flowing through is the forward pass. Untrained, the weights are random and the outputs are noise. Everything now hangs on one question, the entire field in a sentence: how do the weights become right?

2. Loss and gradient descent: learning from wrongness

First, measure the wrongness. A loss function compares the network's prediction with the true label and produces one number: zero for perfect, growing with error. For classification, the standard is cross-entropy, which punishes confident wrong answers most, exactly the incentive you want. Training never tries to be right directly; it tries to make the loss smaller, millions of times in a row.

Here is the idea to keep for life. For every weight in the network, calculus can answer one question cheaply: if this weight rose a hair, would the loss rise or fall? That answer for all weights at once is the gradient, computed by the backpropagation algorithm. The learning rule is then almost embarrassingly simple: nudge every weight a small step in the direction that lowers the loss, where the step size is the learning rate. Show a batch of examples, compute loss, backpropagate, nudge; repeat for the whole dataset, an epoch, several times over. Descending the loss landscape by tiny steps, gradient descent, is the entire engine, and the playground below lets you watch it happen with your own eyes.

Checkpoint

During training, what does the network actually adjust?

3. Keras: the network as Lego

Writing backpropagation by hand is a rite of passage nobody repeats at work; Keras, the friendly face of TensorFlow, reduces a network to declaring its layers. Our task is Fashion MNIST, the modern hello world of vision and the same dataset behind the Fashion MNIST Keras Classifier mini project in the Learn Python app: 70,000 grayscale photos, 28 by 28 pixels, of clothing in ten classes. The network below flattens each image into 784 pixel values, passes them through a hidden ReLU layer of 128 neurons, applies dropout, and ends with 10 softmax outputs, one probability per class.

# Run on your machine or free on Google Colab
import tensorflow as tf
from tensorflow import keras

(X_train, y_train), (X_test, y_test) = \
    keras.datasets.fashion_mnist.load_data()

X_train = X_train / 255.0        # scale pixels to 0..1
X_test  = X_test / 255.0

model = keras.Sequential([
    keras.layers.Flatten(input_shape=(28, 28)),     # 784 inputs
    keras.layers.Dense(128, activation="relu"),     # hidden layer
    keras.layers.Dropout(0.2),                      # anti-overfitting
    keras.layers.Dense(10, activation="softmax"),   # 10 class scores
])

model.compile(optimizer="adam",
              loss="sparse_categorical_crossentropy",
              metrics=["accuracy"])
model.summary()                   # ~101,770 learnable weights

Every line maps to section 1 and 2: Dense layers are the matrix multiplications, relu and softmax the activations, the loss is cross-entropy, and adam is gradient descent with quality-of-life improvements. The summary line is worth a pause: about a hundred thousand weights, each about to be nudged thousands of times. Training and grading reuse the exact discipline of Part 14, fit on training data, evaluate on the hidden test set, same words, same reasons.

history = model.fit(X_train, y_train,
                    epochs=5, batch_size=32,
                    validation_split=0.1)

test_loss, test_acc = model.evaluate(X_test, y_test)
print(f"test accuracy: {test_acc:.2%}")        # ~88% in ~2 minutes

import numpy as np
probs = model.predict(X_test[:1])
print("class probabilities:", np.round(probs[0], 2))
print("best guess:", probs[0].argmax())

Watch the printed history as it trains: training accuracy climbs each epoch, and the validation accuracy, measured on a held-out slice, climbs with it until it plateaus. If you train for fifty epochs, you will watch validation accuracy stall and then sink while training accuracy keeps rising, Part 14's overfitting gap drawn live on your screen. Dropout, which randomly silences neurons during training so none can become a load-bearing memorizer, delays that divergence; early stopping, simply quitting while validation still improves, is the other everyday cure.

Three vocabulary words from that fit call deserve precise meanings, because every tutorial you read next will assume them. The batch_size of 32 means weights update after each group of 32 examples, a compromise between noisy single-example updates and slow full-dataset ones. An epoch is one complete pass over the training data, so five epochs showed the network every image five times. And validation_split=0.1 held out a tenth of training data to grade each epoch, the canary that detects overfitting while training is still running, distinct from the test set, which stays sealed until the very end.

Checkpoint

Training accuracy keeps rising across epochs while validation accuracy peaks and then declines. The network is...

4. Watch gradient descent with your own eyes

The playground below trains a genuine neural network, two inputs, a hidden ReLU layer, one sigmoid output, on the XOR problem, the tiny dataset that famously cannot be solved without a hidden layer, using nothing but NumPy. Forward pass, loss, backpropagation, weight nudges: every gear from sections 1 and 2, in forty lines, running live. Watch the loss fall, then do the exercises; cranking the learning rate too high and watching training explode is a rite of passage best experienced where it costs nothing.

Python playground

import numpy as np

rng = np.random.default_rng(42)
X = np.array([[0,0],[0,1],[1,0],[1,1]], dtype=float)
y = np.array([[0],[1],[1],[0]], dtype=float)      # XOR

W1 = rng.normal(0, 1, (2, 8)); b1 = np.zeros((1, 8))
W2 = rng.normal(0, 1, (8, 1)); b2 = np.zeros((1, 1))
lr = 0.5

def sigmoid(z): return 1 / (1 + np.exp(-z))

for step in range(3001):
    # forward pass
    h = np.maximum(0, X @ W1 + b1)            # hidden ReLU layer
    out = sigmoid(h @ W2 + b2)                # output probability
    loss = np.mean((out - y) ** 2)

# backward pass (backpropagation, by hand)
    d_out = 2 * (out - y) * out * (1 - out)
    d_W2 = h.T @ d_out
    d_h = (d_out @ W2.T) * (h > 0)            # ReLU gate
    d_W1 = X.T @ d_h

# gradient descent: nudge weights downhill
    W2 -= lr * d_W2;  b2 -= lr * d_out.sum(0)
    W1 -= lr * d_W1;  b1 -= lr * d_h.sum(0)

if step % 500 == 0:
        print(f"step {step:4}  loss {loss:.4f}")

print("\npredictions:", np.round(out.ravel(), 2), " target:", y.ravel())

# Exercises:
# 1. Set lr = 5.0 and watch training explode; then 0.01 and watch it crawl.
# 2. Shrink the hidden layer to 1 neuron. Can it still learn XOR? Why not?
# 3. Change y to AND ([0,0,0,1]) and confirm it trains even easier.

5. Standing on giant shoulders: transfer learning

One more idea completes a beginner's map, and it is the most practical of all. Training large networks from scratch needs data and compute most people lack, but networks trained by those who have both are downloadable, and their learned features transfer. Take a model pretrained on millions of photos, chop off its final layer, bolt on a fresh one for your classes, and train only that: suddenly a thousand of your own photos yields a strong classifier in minutes on a free Colab GPU. The Transfer Learning Image Classifier mini project in the Learn Python app walks this exact recipe. Hold onto the concept firmly, because Part 16 reveals modern AI's open secret: the entire large language model era is transfer learning at planetary scale.

Classic ML or deep learning? The honest defaults
Situation	Reach for	Why
Tabular rows and columns (Part 14 territory)	scikit-learn	Trees and ensembles usually win on tables, train in seconds, and explain themselves
Images, audio, free text	Deep learning	Learned features beat hand-made ones decisively on raw perception data
Little data, big ambition	Transfer learning	Pretrained features let small datasets punch far above their weight
Need to explain every decision	Classic ML	A printed tree is auditable; a hundred thousand weights are not
Language understanding and generation	Pretrained LLMs (Part 16)	Training from scratch is industrial; using pretrained models is an API call

! Common mistakes to avoid

✕Forgetting to scale pixel values before training.

✓Networks train poorly on raw 0-255 inputs. Divide by 255 so inputs sit in 0..1; preprocessing is part of the model, not an optional nicety.
✕Judging the model by training accuracy or the last epoch's loss.

✓Only held-out data counts, same law as Part 14. Watch validation metrics during training and report test metrics once, at the end.
✕Cranking epochs until training accuracy hits 100%.

✓That is manufacturing overfitting. Stop where validation accuracy peaks; going further only teaches the network its worksheet.
✕Starting your learning journey by training huge models from scratch.

✓Small datasets, small networks, and transfer learning are where understanding grows. Scale is a tool you earn after the fundamentals, not a substitute for them.

? Frequently asked questions

Do I need a GPU to follow along? +

Not today. Fashion MNIST trains on a laptop CPU in a couple of minutes, and Google Colab gives free GPU notebooks in the browser when you outgrow it. Buy hardware only after free options actually limit you.

TensorFlow or PyTorch? +

Both are excellent, industrial-grade frameworks; PyTorch dominates research, TensorFlow/Keras remains beloved for approachability and deployment. Every concept in this lesson transfers between them almost line for line, so the honest answer is: whichever your next tutorial uses.

Why did my training produce slightly different numbers than yours? +

Weights initialize randomly and data shuffles, so runs vary a little; that is normal. Seeds tame it for experiments, as random_state did in Part 14.

What math should I learn if this hooked me? +

In payoff order: linear algebra (matrices are everything here), then calculus through the chain rule (backpropagation is the chain rule, industrialized), then probability. Learn each just-in-time, attached to code like today's playground, and it sticks.

6. Recap and what comes next

You now own the four-piece mental model that decodes this entire field: neurons computing weighted sums through activations, loss measuring wrongness, gradient descent with backpropagation turning wrongness into millions of tiny corrections, and layers composing simple judgments into perception. You built the engine yourself in NumPy, ran a real Keras classifier to roughly 88 percent on clothing photos, learned to read the overfitting curves, and banked the transfer learning idea that the next lesson scales to planetary size.

The finale awaits: Part 16, a practical introduction to LLMs and RAG, where today's networks grow into models that read and write, you make your first real API call to one, and the series hands you a map of everything to build next. The Fashion MNIST and Transfer Learning mini projects in the Learn Python app below are ideal homework, and the full syllabus is on the series hub.

💡

Pro tip

Run the Keras example on Google Colab tonight, and after training, mispredict on purpose: find a test image the model got wrong, display it, and look at the probability spread. Studying a model's confident mistakes teaches more intuition per minute than any lecture in this field.

Practice on the go

Learn Python, the free Android app

Every topic in this series lives in the app too: bite-size lessons, runnable examples, quizzes, mini projects, and an offline Python playground that runs on your phone.

Get it on Google Play View the app page