Cover image for Working with Text and Data in Python: Regex, JSON, and CSV

At a glance

Reading time

~200 words/min

Published

8 hours ago

Jun 13, 2026

Views

2

All-time total

Working with Text and Data in Python: Regex, JSON, and CSV

Almost every practical program is, at heart, a translator: data arrives in one shape and must leave in another. A log file becomes a report, a spreadsheet export becomes database rows, an API response becomes objects your code can use. This lesson teaches the three tools that handle the overwhelming majority of that work: regular expressions for finding structure inside free text, JSON for the format that web services and configuration files speak, and CSV for the format spreadsheets and data exports speak. Master these three and the phrase "can you process this file" stops being scary forever.

Fair warning about regular expressions, because honesty teaches better than cheerleading: regex is a compact pattern language with a real learning curve, and nobody remembers all of it. The professionals you admire know perhaps a dozen building blocks cold and look up the rest, every time, without shame. This lesson teaches exactly that core dozen, and the playground lets you test patterns live, which is genuinely the only way regex has ever been learned by anyone.

What you will learn in Part 11

  • The string methods that solve text problems without regex
  • Regex building blocks: classes, quantifiers, anchors, and groups
  • search, findall, and sub: the three re functions that matter
  • Reading and writing JSON with loads and dumps
  • CSV done right with DictReader and DictWriter
  • A complete pipeline: messy CSV in, clean JSON out

Note

Before you start

You need dictionaries and lists from Part 5, files from Part 7, and ideally the pipeline mindset of Part 9. The JSON and CSV lessons in the Learn Python app pair with this part.

1. First, the tools you already have

Reach for regex last, not first. Python strings carry a toolbox that solves most everyday text problems in one readable call: strip and friends remove whitespace, split and join convert between strings and lists, replace substitutes, startswith and endswith test edges, lower normalizes case, and the in operator finds substrings. The find versus regex decision is simple: if you are looking for literal text, strings suffice; if you are looking for a shape of text, like any email or any date, that is regex territory.

line = "  Amina Perera <amina@example.com>  "

clean = line.strip()
print(clean.lower().startswith("amina"))   # True
print("@" in clean)                        # True

name_part = clean.split("<")[0].strip()
print(name_part)                           # Amina Perera
print("-".join(["2026", "06", "13"]))      # 2026-06-13

2. Regex: the twelve building blocks

A regular expression is a pattern that describes a family of strings. The vocabulary that covers daily work: a dot matches any character; \d, \w, and \s match a digit, a word character, and whitespace, with capitals meaning the opposite; square brackets define your own character class like [aeiou]; quantifiers say how many, with * for zero or more, + for one or more, ? for optional, and {n,m} for a counted range; anchors ^ and $ pin the pattern to the start and end; parentheses group and capture; and the pipe means or. Write patterns as raw strings, r"...", so backslashes survive untouched.

import re

text = "Order #4321 shipped on 2026-06-13 to amina@example.com"

print(re.search(r"\d{4}-\d{2}-\d{2}", text).group())   # 2026-06-13
print(re.findall(r"#\d+", text))                       # ['#4321']

m = re.search(r"(\w+)@([\w.]+)", text)                 # capture groups
print(m.group(1), "at", m.group(2))    # amina at example.com

masked = re.sub(r"\w+@[\w.]+", "[email hidden]", text)
print(masked)

One subtlety separates finding from validating, and the EMAIL pattern in the playground below depends on it. re.search answers "does this shape appear anywhere in the text", which is right for extraction; validation asks "is this entire string the shape", which needs the anchors: ^ at the start, $ at the end, or the convenience function re.fullmatch that implies both. An email validator without anchors happily accepts "junk amina@example.com junk", a bug that has shipped to production more times than anyone will admit. Extracting: no anchors. Validating: anchors, always.

Those three functions are the whole everyday API. re.search finds the first match anywhere and returns a match object, or None, which after Part 10 you recognize as a union begging for a guard. re.findall returns every match as a list of strings, or tuples when the pattern has groups. re.sub replaces matches, regex-powered find and replace. Groups are the superpower: parentheses carve a match into named pieces, turning find the date into extract year, month, and day in one motion.

Checkpoint

Which pattern matches a Sri Lankan style phone number like 077-1234567?

Three refinements take your regex from classroom to workplace. When a pattern is used repeatedly, compile it once with re.compile and reuse the object, which is faster and gives the pattern a name that documents intent. Flags adjust matching behavior: re.IGNORECASE does what it says, and re.MULTILINE makes ^ and $ work per line instead of per string, essential when scanning whole files. And know that quantifiers are greedy by default, matching as much as possible, so r"<.+>" swallows from the first angle bracket to the last; the lazy variant .+? stops at the first close, and that one question mark is the difference between extracting one HTML tag and extracting the whole document.

3. JSON: the language of APIs

JSON, JavaScript Object Notation, is how web services, configuration files, and most modern tools exchange structured data, and Python wears it like a glove because the mapping is nearly one to one: JSON objects are dicts, arrays are lists, strings, numbers, booleans, and null map to str, numbers, True, False, and None. The json module needs exactly four functions: loads parses a string, dumps serializes to one, and load and dump do the same straight from and to files. Real JSON nests, dicts holding lists of dicts, and your Part 5 skills walk it naturally.

import json

raw = '{"name": "Amina", "scores": [87, 91], "active": true}'
student = json.loads(raw)
print(student["scores"][1])                  # 91
print(type(student["active"]))               # <class 'bool'>

student["city"] = "Colombo"
print(json.dumps(student, indent=2))         # pretty, share-ready

with open("student.json", "w") as f:         # to a file
    json.dump(student, f, indent=2)
with open("student.json") as f:              # and back
    again = json.load(f)
print(again == student)                      # True: a clean round trip

Two habits separate the professionals here. First, malformed JSON raises json.JSONDecodeError, and any JSON arriving from the outside world deserves the try treatment from Part 7. Second, when a service hands you nested JSON, explore it interactively before coding against it: load it in the REPL, check .keys(), drill one level, repeat. Five minutes of mapping saves an hour of KeyError whack-a-mole, a workflow that becomes second nature by the time you call your first real API in Part 16.

4. CSV: the language of spreadsheets

CSV is deceptively simple, rows of comma-separated values, and the deception is the point: fields can contain commas inside quotes, quotes escaped by doubling, line breaks inside fields. Never parse CSV with split(","), because the edge cases will find you. The csv module handles them all, and its dictionary flavor is the one to learn: DictReader uses the header row to give you each record as a dict, and DictWriter does the reverse. Suddenly every row is the labeled data structure you have been fluent in since Part 5.

import csv

with open("sales.csv", "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=["product", "qty", "price"])
    writer.writeheader()
    writer.writerow({"product": "Laptop, 15 inch", "qty": 2, "price": 185000})
    writer.writerow({"product": "Mouse", "qty": 10, "price": 2500})

with open("sales.csv", newline="") as f:
    for row in csv.DictReader(f):
        value = int(row["qty"]) * float(row["price"])
        print(f"{row['product']:18} -> {value:>10,.0f}")

Note the comma surviving safely inside "Laptop, 15 inch", the quoting handled invisibly in both directions, and one eternal gotcha: CSV gives you strings, always, so numbers need the int and float conversions you have been doing since Part 1, with the try protection of Part 7 when the data is untrusted. The newline="" argument in open is the csv module's documented requirement; treat it as part of the incantation.

The csv module also handles the format's many dialects. Tab-separated exports need only delimiter="\t" in the reader, European spreadsheets that use semicolons need delimiter=";", and unusual quoting conventions have matching options. When a file arrives looking almost like CSV but not quite, resist the urge to hand-parse: open it in a text editor, identify the delimiter and quoting by eye, and tell DictReader, which has seen every dialect you will ever meet.

Checkpoint

Why is line.split(",") the wrong way to parse a CSV file?

5. Practice: the full translator

Now the lesson's promise made real: a complete pipeline that takes a messy CSV export, validates rows with regex, converts types, skips and reports the broken ones, and emits clean JSON, every stage built from a lesson in this course. This shape, ingest, validate, transform, emit, is a substantial fraction of all paid Python work, and the CSV Sales Analyzer mini project in the Learn Python app is a bigger sibling of exactly this program.

Python playground

Study how the failure handling works: regex validates shape, int() validates numbers, both failure modes funnel into one except as ValueError, and rejects are collected with reasons rather than discarded, the Part 7 discipline. When this pipeline grows up, the validation layer becomes Pydantic models and the shapes become type-checked end to end; the structured outputs lesson in our advanced series even teaches language models to emit JSON that passes this kind of gate.

! Common mistakes to avoid

  • Writing regex without raw strings: "\d+" instead of r"\d+".

    Normal strings eat backslashes before regex sees them, breaking patterns unpredictably. The r prefix on every pattern, no exceptions, ends the problem.

  • Calling .group() on a search that may have returned None.

    re.search returns Match | None, your Part 10 union. Guard it: m = re.search(...), then if m: use m.group(). Checkers flag this; production crashes taught them to.

  • Reaching for regex when a string method reads better.

    if text.startswith("ERROR") beats re.match(r"^ERROR", text) every time. Regex is for shapes, not literals; the simplest tool that works wins reviews.

  • Forgetting that CSV fields are always strings.

    row["qty"] is the string "2" until you convert it. Type conversions at the boundary, with try/except for untrusted data, are part of every honest CSV reader.

? Frequently asked questions

How do I learn to write bigger regex patterns? +

Incrementally, in a tester. Build the pattern one block at a time against real sample text, exactly like the playground exercise. Nobody writes a forty-character pattern in one go; everyone composes and tests.

JSON or CSV for my own program's data? +

JSON for nested or typed data and config; CSV when humans will open it in a spreadsheet or it is naturally tabular. For serious volumes or queries, Part 13's project introduces SQLite, the third option that beats both.

What about Excel files, .xlsx? +

Third-party territory: openpyxl or pandas read them directly. Often the pragmatic answer is exporting to CSV first; every spreadsheet tool speaks it, and then today's lesson applies unchanged.

Is there a JSON gotcha with Python types? +

A few worth knowing: JSON keys are always strings (numeric dict keys become strings on the round trip), tuples become lists, and datetimes need converting before dumping. When a dump fails, the error names the offending type precisely.

6. Recap and what comes next

You can now translate between the world's three everyday data dialects: string methods for literal text, a working regex core of classes, quantifiers, anchors, and groups driven through search, findall, and sub, JSON round trips with the json module, and robust CSV with DictReader and DictWriter, all combined into a validating pipeline with honest error reporting. This is the lesson employers quietly test in interviews, because it is the lesson daily work is made of.

Next, the course tackles the topic with the most mystique and strips it down to plain decisions: Part 12, concurrency, where your programs learn to wait for many things at once. Before then, finish the three playground exercises above, and run the JSON and CSV quizzes in the Learn Python app below; the full sixteen-part syllabus is on the series hub.

💡

Pro tip

Keep a personal patterns.py file of regexes you have tested and trusted, with a sample match in a comment beside each. Six months from now, your own curated dozen will outperform any cheat sheet on the internet, because every entry will be one you actually understood once.

Learn Python Android app icon

Practice on the go

Learn Python, the free Android app

Every topic in this series lives in the app too: bite-size lessons, runnable examples, quizzes, mini projects, and an offline Python playground that runs on your phone.

Newsletter

Want more posts like this?

Get practical software notes and tutorials delivered when something new is published.

No spam. Unsubscribe anytime.

How did this land?

Comments

0
Log in or sign up to join the discussion and react to this post.

No comments yet. Be the first to share your thoughts.

Related posts

Important functionalities of Pandas in Python : Tricks and Features

Pandas is one of my favorite libraries in python. It’s very useful to visualize the data in a clean structural manner. Nowadays Pandas is widely used in Data Science, Machine Learning and other areas.

5 years ago

How to get data from twitter using Tweepy in Python?

To start working on Python you need to have Python installed on your PC. If you haven’t installed python. Go to the Python website and get it installed.

6 years ago

Predicting per capita income of the US using linear regression

Python enables us to predict and analyze any given data using Linear regression. Linear Regression is one of the basic machine learning or statistical techniques created to solve complex problems.

6 years ago

Essential Sorting Algorithms for Computer Science Students

Algorithms are commonly taught in Computer Science, Software Engineering subjects at your Bachelors or Masters. Some find it difficult to understand due to memorizing.

6 years ago

GraphQL in Laravel Using Lighthouse

In modern web development, GraphQL has emerged as a powerful alternative to REST APIs due to its flexibility and efficiency.

1 year ago