Regular Expressions Tutorial: A Beginner-Friendly Guide to Regex

· 12 min read

Table of Contents

What Are Regular Expressions and Why Learn Them?

Regular expressions (commonly abbreviated as regex or regexp) are powerful pattern-matching tools that allow you to search, validate, extract, and manipulate text using specialized syntax. Think of them as a sophisticated search language that goes far beyond simple "find and replace" operations.

Imagine you need to extract all email addresses from a file containing thousands of lines of log data, or validate that a user's phone number follows the correct format. Using traditional string manipulation methods would result in verbose, hard-to-maintain code. Regular expressions can accomplish these tasks with a single, concise pattern.

At their core, regular expressions define search patterns using a combination of literal characters and special metacharacters. These patterns can match simple strings like "cat" or complex structures like email addresses, URLs, or credit card numbers.

Why Should You Learn Regular Expressions?

Real-World Applications

Regular expressions are used extensively across software development and data processing:

Pro tip: While regex is powerful, it's not always the best tool for every job. For parsing complex structured data like HTML or JSON, use dedicated parsers instead. Regex works best for pattern matching in plain text.

Basic Syntax and Fundamentals

Regular expressions consist of two types of characters: literal characters (which match themselves) and metacharacters (which have special meanings). Let's start with the fundamentals.

Literal Characters

The simplest regex is just plain text. The pattern cat will match the exact string "cat" in your text.

Text: "The cat sat on the mat"
Regex: cat
Matches: "The cat sat on the mat"

Literal characters are case-sensitive by default, so cat won't match "Cat" or "CAT" unless you use a case-insensitive flag.

The Dot Metacharacter (.)

The dot . is a wildcard that matches any single character except newline characters.

Text: "cat", "cot", "cut", "c@t"
Regex: c.t
Matches: All four strings

To match a literal dot character, escape it with a backslash: \.

Text: "file.txt"
Regex: file\.txt
Matches: "file.txt" (not "fileAtxt")

Anchors: Matching Positions

Anchors don't match characters—they match positions in the text.

Caret (^) - Start of Line: The ^ anchor matches the beginning of a string or line.

Text: "cat\ndog\ncat"
Regex: ^cat
Matches: Only the first "cat"

Dollar Sign ($) - End of Line: The $ anchor matches the end of a string or line.

Text: "cat\ndog\ncat"
Regex: cat$
Matches: Only the last "cat"

Combining Anchors: Use both to match entire lines.

Regex: ^cat$
Matches: Only lines containing exactly "cat" with nothing before or after

Word Boundaries (\b)

The \b anchor matches word boundaries—positions between word and non-word characters.

Text: "cat category caterpillar"
Regex: \bcat\b
Matches: Only the standalone word "cat"

This is incredibly useful for finding whole words without matching partial words.

Escape Sequences

Special characters in regex need to be escaped with a backslash to match them literally:

Special Characters Escaped Form
. * + ? ^ $ { } [ ] ( ) | \ \. \* \+ \? \^ \$ \{ \} \[ \] \( \) \| \\

Example matching a price:

Regex: \$\d+\.\d{2}
Matches: "$19.99", "$5.00"

Character Classes and Ranges

Character classes let you define a set of characters and match any one of them. They're enclosed in square brackets.

Basic Character Classes

Square brackets [] create a character set that matches any single character inside.

Text: "cat", "cot", "cut", "cit"
Regex: c[aou]t
Matches: "cat", "cot", "cut" (not "cit")

Character Ranges

Use hyphens to define ranges of characters:

Text: "a1", "b2", "c3", "d4"
Regex: [a-c][1-3]
Matches: "a1", "b2", "c3" (not "d4")

Negated Character Classes

Use a caret ^ at the start of a character class to negate it—matching any character NOT in the set.

Regex: [^0-9]
Matches: Any character that is NOT a digit
Text: "abc123def"
Regex: [^a-z]+
Matches: "123" (the sequence of non-lowercase letters)

Predefined Character Classes

Regex provides shorthand for common character classes:

Shorthand Equivalent Description
\d [0-9] Any digit
\D [^0-9] Any non-digit
\w [a-zA-Z0-9_] Any word character
\W [^a-zA-Z0-9_] Any non-word character
\s [ \t\n\r\f\v] Any whitespace character
\S [^ \t\n\r\f\v] Any non-whitespace character

Example matching a simple phone number:

Regex: \d{3}-\d{3}-\d{4}
Matches: "555-123-4567"

Quick tip: Uppercase versions of shorthand classes are always the negation of their lowercase counterparts. \d matches digits, \D matches non-digits.

Quantifiers: Controlling Match Repetition

Quantifiers specify how many times a character or group should be matched. They're placed after the element you want to repeat.

Basic Quantifiers

Examples of Quantifiers in Action

Asterisk (*) - Zero or More:

Regex: ca*t
Matches: "ct", "cat", "caat", "caaat"

Plus (+) - One or More:

Regex: ca+t
Matches: "cat", "caat", "caaat" (not "ct")

Question Mark (?) - Optional:

Regex: colou?r
Matches: "color" and "colour"

Exact Count {n}:

Regex: \d{3}
Matches: Exactly three digits like "123"

Range {n,m}:

Regex: \d{2,4}
Matches: 2 to 4 digits like "12", "123", or "1234"

Greedy vs. Lazy Quantifiers

By default, quantifiers are greedy—they match as much text as possible. Adding ? after a quantifier makes it lazy (matching as little as possible).

Text: "<div>content</div><div>more</div>"
Regex (greedy): <div>.*</div>
Matches: "<div>content</div><div>more</div>" (entire string)

Regex (lazy): <div>.*?</div>
Matches: "<div>content</div>" (first tag only)

Lazy quantifiers:

Pro tip: Greedy matching can cause performance issues with large texts. Use lazy quantifiers when you need to match the shortest possible string, especially when working with nested structures.

Practical Example: Matching HTML Tags

Regex: <([a-z]+)>.*?</\1>
Matches: Paired HTML tags like "<p>text</p>" or "<div>content</div>"

This pattern uses lazy matching to avoid capturing multiple tags at once, and backreferences (covered next) to ensure opening and closing tags match.

Groups and Capturing

Parentheses () create groups that serve multiple purposes: they group parts of a pattern together, capture matched text for later use, and enable backreferences.

Basic Grouping

Groups let you apply quantifiers to multiple characters:

Regex: (ha)+
Matches: "ha", "haha", "hahaha"

Without grouping, ha+ would match "ha", "haa", "haaa" (only the 'a' repeats).

Capturing Groups

Groups automatically capture the text they match, which you can reference later:

Text: "John Smith"
Regex: (\w+) (\w+)
Captures: Group 1 = "John", Group 2 = "Smith"

In most programming languages, you can access these captures:

// JavaScript example
const match = "John Smith".match(/(\w+) (\w+)/);
console.log(match[1]); // "John"
console.log(match[2]); // "Smith"

Backreferences

Backreferences let you match the same text that was captured by a group earlier in the pattern. Use \1, \2, etc.

Regex: (\w+) \1
Matches: Repeated words like "the the" or "is is"
Regex: <([a-z]+)>.*?</\1>
Matches: Matching HTML tags like "<div>...</div>"

Non-Capturing Groups

Sometimes you need grouping without capturing. Use (?:...) for non-capturing groups:

Regex: (?:https?|ftp)://\S+
Matches: URLs starting with http, https, or ftp
(The protocol isn't captured as a group)

Non-capturing groups improve performance when you don't need to reference the captured text.

Named Capturing Groups

Named groups make your regex more readable and maintainable:

Regex: (?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})
Matches: Dates like "2026-03-31"
Access: match.groups.year, match.groups.month, match.groups.day

Named groups are especially useful in complex patterns where numbered references become confusing.

Quick tip: Use named groups for complex patterns that you or others will need to maintain. The slight verbosity pays off in readability.

Alternation and Choice Operators

The pipe symbol | acts as an OR operator, allowing you to match one pattern or another.

Basic Alternation

Regex: cat|dog
Matches: Either "cat" or "dog"
Regex: gray|grey
Matches: Both American and British spellings

Alternation with Groups

Combine alternation with groups for more complex patterns:

Regex: (Mr|Ms|Mrs|Dr)\. \w+
Matches: "Mr. Smith", "Dr. Jones", "Ms. Williams"
Regex: \.(jpg|jpeg|png|gif)$
Matches: Image file extensions at the end of filenames

Order Matters

The regex engine tries alternatives from left to right and stops at the first match:

Text: "category"
Regex: cat|category
Matches: "cat" (stops after first match)

Better: category|cat
Matches: "category" (longer match first)

Always put longer or more specific alternatives first to avoid premature matching.

Practical Examples

Matching multiple date formats:

Regex: \d{4}-\d{2}-\d{2}|\d{2}/\d{2}/\d{4}
Matches: "2026-03-31" or "03/31/2026"

Matching different phone formats:

Regex: \d{3}-\d{3}-\d{4}|\(\d{3}\) \d{3}-\d{4}
Matches: "555-123-4567" or "(555) 123-4567"

Lookahead and Lookbehind Assertions

Lookaround assertions check if a pattern exists before or after the current position without including it in the match. They're zero-width assertions—they don't consume characters.

Positive Lookahead (?=...)

Matches a position where the pattern inside the lookahead follows:

Regex: \d+(?= dollars)
Text: "50 dollars and 30 euros"
Matches: "50" (only the number before "dollars")

The lookahead checks for " dollars" but doesn't include it in the match.

Negative Lookahead (?!...)

Matches a position where the pattern inside does NOT follow:

Regex: \d+(?! dollars)
Text: "50 dollars and 30 euros"
Matches: "30" (the number NOT followed by "dollars")

Positive Lookbehind (?<=...)

Matches a position where the pattern inside precedes:

Regex: (?<=\$)\d+
Text: "Price: $50 and €30"
Matches: "50" (only the number after "$")

Negative Lookbehind (?<!...)

Matches a position where the pattern inside does NOT precede:

Regex: (?<!\$)\d+
Text: "Price: $50 and €30"
Matches: "30" (the number NOT preceded by "$")

Practical Applications

Password Validation: Ensure a password contains at least one uppercase letter, one lowercase letter, and one digit:

Regex: ^(?=.*[a-z])(?=.*[A-Z])(?=.*\d).{8,}$
Validates: Passwords with all requirements met

Extract Domain from Email:

Regex: (?<=@)[a-zA-Z0-9.-]+
Text: "[email protected]"
Matches: "example.com"

Find Numbers Not in Parentheses:

Regex: (?<!\()\d+(?!\))
Text: "Call (555) 123-4567"
Matches: "123" and "4567" (not "555")

Pro tip: Lookaround assertions are powerful but can impact performance. Use them judiciously, especially in patterns that will process large amounts of text. Some regex engines have limited lookbehind support.

Common Regex Patterns Reference Table

Here's a comprehensive reference of frequently used regex patterns for common validation and extraction tasks. You can test these patterns using our Regex Tester Tool.

Pattern Type Regex Description
Email Address [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,} Basic email validation
URL https?://[^\s]+ Simple URL matching
Phone (US) \(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4} Various US phone formats
ZIP Code (US) \d{5}(-\d{4})? 5-digit or ZIP+4
IP Address (IPv4) \b(?:\d{1,3}\.){3}\d{1,3}\b Basic IPv4 format
Date (YYYY-MM-DD) \d{4}-\d{2}-\d{2} ISO date format
Time (24-hour) [0-2]\d:[0-5]\d HH:MM format
Hex Color #[0-9A-Fa-f]{6}\b 6-digit hex colors
Credit Card \d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4} 16-digit card numbers
Username ^[a-zA-Z0-9_]{3,16}$ 3-16 alphanumeric chars
Strong Password ^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-z\d@$!%*?&]{8,}$ Min 8 chars with requirements
HTML Tag <([a-z]+)([^<]+)*(?:>(.*?)<\/\1>|\s+\/>) Opening and closing tags

Advanced Pattern Examples

Use Case Regex Example Match
Extract hashtags #\w+ #regex #tutorial
Extract mentions @\w+ @username
File extension \.\w+$ .txt, .pdf, .jpg
Remove extra spaces \s+ Multiple spaces
Markdown links \[([^\]]+)\]\(([^)]+)\) [text](url)
CSS class names class="([^"]*)" class="container"
Extract numbers -?\d+\.?\d* 123, -45.67
Trim whitespace ^\s+|\s+$ Leading/trailing spaces

You can also use our String Formatter Tool to clean and format text data after extraction.

Using Regex Across Different Languages

While regex syntax is largely consistent across languages, implementation details vary. Here's how to use regex in popular programming languages.

JavaScript

// Literal notation
const regex = /\d{3}-\d{3}-\d{4}/;

// Constructor notation
const regex2 = new RegExp('\\d{3}-\\d{3}-\\d{4}');

// Test if pattern matches
regex.test('555-123-4567'); // true

// Extract matches
const match = '555-123-4567'.match(regex);

// Replace
const result = 'Call 555-123-4567'.replace(regex, 'XXX-XXX-XXXX');

// Global flag for multiple matches
const globalRegex = /\d+/g;
'a1b2c3'.match(globalRegex); // ['1', '2', '3']

Python

import re

# Compile pattern
pattern = re.compile(r'\d{3}-\d{3}-\d{4}')

# Test if pattern matches
pattern.search('555-123-4567')  # Match object or None

# Find all matches
re.findall(r'\d+', 'a1b2c3')  # ['1', '2', '3']

# Replace
re.sub(r'\d{3}-\d{3}-\d{4}', 'XXX-XXX-XXXX', 'Call 555-123-4567')

# Split by pattern
re.split(r'\s+', 'split  by   spaces')  # ['split