Regular Expressions Tutorial: A Beginner-Friendly Guide to Regex
· 12 min read
Table of Contents
- What Are Regular Expressions and Why Learn Them?
- Basic Syntax and Fundamentals
- Character Classes and Ranges
- Quantifiers: Controlling Match Repetition
- Groups and Capturing
- Alternation and Choice Operators
- Lookahead and Lookbehind Assertions
- Common Regex Patterns Reference Table
- Using Regex Across Different Languages
- Performance Optimization Tips
- Testing and Debugging Regular Expressions
- Frequently Asked Questions
What Are Regular Expressions and Why Learn Them?
Regular expressions (commonly abbreviated as regex or regexp) are powerful pattern-matching tools that allow you to search, validate, extract, and manipulate text using specialized syntax. Think of them as a sophisticated search language that goes far beyond simple "find and replace" operations.
Imagine you need to extract all email addresses from a file containing thousands of lines of log data, or validate that a user's phone number follows the correct format. Using traditional string manipulation methods would result in verbose, hard-to-maintain code. Regular expressions can accomplish these tasks with a single, concise pattern.
At their core, regular expressions define search patterns using a combination of literal characters and special metacharacters. These patterns can match simple strings like "cat" or complex structures like email addresses, URLs, or credit card numbers.
Why Should You Learn Regular Expressions?
- Text Processing Efficiency: Regex can quickly process large volumes of text data and perform complex search and replace operations that would take dozens of lines of conventional code
- Data Validation: Validating user input (emails, phone numbers, password strength) is a common requirement in web development, and regex provides elegant solutions
- Data Extraction: Extract structured information from unstructured text, such as pulling links from web pages or error messages from logs
- Cross-Platform Universality: Nearly every programming language and text editor supports regular expressions with similar syntax
- Productivity Boost: Mastering regex can dramatically reduce the time spent writing repetitive code and performing manual text operations
- Code Refactoring: Quickly find and modify patterns across entire codebases during refactoring projects
Real-World Applications
Regular expressions are used extensively across software development and data processing:
- Web Form Validation: Ensuring emails, phone numbers, postal codes, and other user inputs match expected formats
- Log Analysis: Parsing server logs to extract error messages, IP addresses, timestamps, and other relevant data
- Text Editor Operations: Advanced search and replace in IDEs like VS Code, Sublime Text, or Vim
- Web Scraping: Extracting specific data patterns from HTML content when building web crawlers
- Configuration File Parsing: Reading and validating configuration files with specific syntax requirements
- Data Cleaning: Standardizing inconsistent data formats in datasets before analysis
- Security: Detecting malicious patterns in user input to prevent injection attacks
Pro tip: While regex is powerful, it's not always the best tool for every job. For parsing complex structured data like HTML or JSON, use dedicated parsers instead. Regex works best for pattern matching in plain text.
Basic Syntax and Fundamentals
Regular expressions consist of two types of characters: literal characters (which match themselves) and metacharacters (which have special meanings). Let's start with the fundamentals.
Literal Characters
The simplest regex is just plain text. The pattern cat will match the exact string "cat" in your text.
Text: "The cat sat on the mat"
Regex: cat
Matches: "The cat sat on the mat"
Literal characters are case-sensitive by default, so cat won't match "Cat" or "CAT" unless you use a case-insensitive flag.
The Dot Metacharacter (.)
The dot . is a wildcard that matches any single character except newline characters.
Text: "cat", "cot", "cut", "c@t"
Regex: c.t
Matches: All four strings
To match a literal dot character, escape it with a backslash: \.
Text: "file.txt"
Regex: file\.txt
Matches: "file.txt" (not "fileAtxt")
Anchors: Matching Positions
Anchors don't match characters—they match positions in the text.
Caret (^) - Start of Line: The ^ anchor matches the beginning of a string or line.
Text: "cat\ndog\ncat"
Regex: ^cat
Matches: Only the first "cat"
Dollar Sign ($) - End of Line: The $ anchor matches the end of a string or line.
Text: "cat\ndog\ncat"
Regex: cat$
Matches: Only the last "cat"
Combining Anchors: Use both to match entire lines.
Regex: ^cat$
Matches: Only lines containing exactly "cat" with nothing before or after
Word Boundaries (\b)
The \b anchor matches word boundaries—positions between word and non-word characters.
Text: "cat category caterpillar"
Regex: \bcat\b
Matches: Only the standalone word "cat"
This is incredibly useful for finding whole words without matching partial words.
Escape Sequences
Special characters in regex need to be escaped with a backslash to match them literally:
| Special Characters | Escaped Form |
|---|---|
| . * + ? ^ $ { } [ ] ( ) | \ | \. \* \+ \? \^ \$ \{ \} \[ \] \( \) \| \\ |
Example matching a price:
Regex: \$\d+\.\d{2}
Matches: "$19.99", "$5.00"
Character Classes and Ranges
Character classes let you define a set of characters and match any one of them. They're enclosed in square brackets.
Basic Character Classes
Square brackets [] create a character set that matches any single character inside.
Text: "cat", "cot", "cut", "cit"
Regex: c[aou]t
Matches: "cat", "cot", "cut" (not "cit")
Character Ranges
Use hyphens to define ranges of characters:
[a-z]- Any lowercase letter[A-Z]- Any uppercase letter[0-9]- Any digit[a-zA-Z]- Any letter (upper or lower)[a-z0-9]- Any letter or digit
Text: "a1", "b2", "c3", "d4"
Regex: [a-c][1-3]
Matches: "a1", "b2", "c3" (not "d4")
Negated Character Classes
Use a caret ^ at the start of a character class to negate it—matching any character NOT in the set.
Regex: [^0-9]
Matches: Any character that is NOT a digit
Text: "abc123def"
Regex: [^a-z]+
Matches: "123" (the sequence of non-lowercase letters)
Predefined Character Classes
Regex provides shorthand for common character classes:
| Shorthand | Equivalent | Description |
|---|---|---|
\d |
[0-9] |
Any digit |
\D |
[^0-9] |
Any non-digit |
\w |
[a-zA-Z0-9_] |
Any word character |
\W |
[^a-zA-Z0-9_] |
Any non-word character |
\s |
[ \t\n\r\f\v] |
Any whitespace character |
\S |
[^ \t\n\r\f\v] |
Any non-whitespace character |
Example matching a simple phone number:
Regex: \d{3}-\d{3}-\d{4}
Matches: "555-123-4567"
Quick tip: Uppercase versions of shorthand classes are always the negation of their lowercase counterparts. \d matches digits, \D matches non-digits.
Quantifiers: Controlling Match Repetition
Quantifiers specify how many times a character or group should be matched. They're placed after the element you want to repeat.
Basic Quantifiers
*- Zero or more times+- One or more times?- Zero or one time (makes something optional){n}- Exactly n times{n,}- At least n times{n,m}- Between n and m times
Examples of Quantifiers in Action
Asterisk (*) - Zero or More:
Regex: ca*t
Matches: "ct", "cat", "caat", "caaat"
Plus (+) - One or More:
Regex: ca+t
Matches: "cat", "caat", "caaat" (not "ct")
Question Mark (?) - Optional:
Regex: colou?r
Matches: "color" and "colour"
Exact Count {n}:
Regex: \d{3}
Matches: Exactly three digits like "123"
Range {n,m}:
Regex: \d{2,4}
Matches: 2 to 4 digits like "12", "123", or "1234"
Greedy vs. Lazy Quantifiers
By default, quantifiers are greedy—they match as much text as possible. Adding ? after a quantifier makes it lazy (matching as little as possible).
Text: "<div>content</div><div>more</div>"
Regex (greedy): <div>.*</div>
Matches: "<div>content</div><div>more</div>" (entire string)
Regex (lazy): <div>.*?</div>
Matches: "<div>content</div>" (first tag only)
Lazy quantifiers:
*?- Zero or more (lazy)+?- One or more (lazy)??- Zero or one (lazy){n,m}?- Between n and m (lazy)
Pro tip: Greedy matching can cause performance issues with large texts. Use lazy quantifiers when you need to match the shortest possible string, especially when working with nested structures.
Practical Example: Matching HTML Tags
Regex: <([a-z]+)>.*?</\1>
Matches: Paired HTML tags like "<p>text</p>" or "<div>content</div>"
This pattern uses lazy matching to avoid capturing multiple tags at once, and backreferences (covered next) to ensure opening and closing tags match.
Groups and Capturing
Parentheses () create groups that serve multiple purposes: they group parts of a pattern together, capture matched text for later use, and enable backreferences.
Basic Grouping
Groups let you apply quantifiers to multiple characters:
Regex: (ha)+
Matches: "ha", "haha", "hahaha"
Without grouping, ha+ would match "ha", "haa", "haaa" (only the 'a' repeats).
Capturing Groups
Groups automatically capture the text they match, which you can reference later:
Text: "John Smith"
Regex: (\w+) (\w+)
Captures: Group 1 = "John", Group 2 = "Smith"
In most programming languages, you can access these captures:
// JavaScript example
const match = "John Smith".match(/(\w+) (\w+)/);
console.log(match[1]); // "John"
console.log(match[2]); // "Smith"
Backreferences
Backreferences let you match the same text that was captured by a group earlier in the pattern. Use \1, \2, etc.
Regex: (\w+) \1
Matches: Repeated words like "the the" or "is is"
Regex: <([a-z]+)>.*?</\1>
Matches: Matching HTML tags like "<div>...</div>"
Non-Capturing Groups
Sometimes you need grouping without capturing. Use (?:...) for non-capturing groups:
Regex: (?:https?|ftp)://\S+
Matches: URLs starting with http, https, or ftp
(The protocol isn't captured as a group)
Non-capturing groups improve performance when you don't need to reference the captured text.
Named Capturing Groups
Named groups make your regex more readable and maintainable:
Regex: (?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})
Matches: Dates like "2026-03-31"
Access: match.groups.year, match.groups.month, match.groups.day
Named groups are especially useful in complex patterns where numbered references become confusing.
Quick tip: Use named groups for complex patterns that you or others will need to maintain. The slight verbosity pays off in readability.
Alternation and Choice Operators
The pipe symbol | acts as an OR operator, allowing you to match one pattern or another.
Basic Alternation
Regex: cat|dog
Matches: Either "cat" or "dog"
Regex: gray|grey
Matches: Both American and British spellings
Alternation with Groups
Combine alternation with groups for more complex patterns:
Regex: (Mr|Ms|Mrs|Dr)\. \w+
Matches: "Mr. Smith", "Dr. Jones", "Ms. Williams"
Regex: \.(jpg|jpeg|png|gif)$
Matches: Image file extensions at the end of filenames
Order Matters
The regex engine tries alternatives from left to right and stops at the first match:
Text: "category"
Regex: cat|category
Matches: "cat" (stops after first match)
Better: category|cat
Matches: "category" (longer match first)
Always put longer or more specific alternatives first to avoid premature matching.
Practical Examples
Matching multiple date formats:
Regex: \d{4}-\d{2}-\d{2}|\d{2}/\d{2}/\d{4}
Matches: "2026-03-31" or "03/31/2026"
Matching different phone formats:
Regex: \d{3}-\d{3}-\d{4}|\(\d{3}\) \d{3}-\d{4}
Matches: "555-123-4567" or "(555) 123-4567"
Lookahead and Lookbehind Assertions
Lookaround assertions check if a pattern exists before or after the current position without including it in the match. They're zero-width assertions—they don't consume characters.
Positive Lookahead (?=...)
Matches a position where the pattern inside the lookahead follows:
Regex: \d+(?= dollars)
Text: "50 dollars and 30 euros"
Matches: "50" (only the number before "dollars")
The lookahead checks for " dollars" but doesn't include it in the match.
Negative Lookahead (?!...)
Matches a position where the pattern inside does NOT follow:
Regex: \d+(?! dollars)
Text: "50 dollars and 30 euros"
Matches: "30" (the number NOT followed by "dollars")
Positive Lookbehind (?<=...)
Matches a position where the pattern inside precedes:
Regex: (?<=\$)\d+
Text: "Price: $50 and €30"
Matches: "50" (only the number after "$")
Negative Lookbehind (?<!...)
Matches a position where the pattern inside does NOT precede:
Regex: (?<!\$)\d+
Text: "Price: $50 and €30"
Matches: "30" (the number NOT preceded by "$")
Practical Applications
Password Validation: Ensure a password contains at least one uppercase letter, one lowercase letter, and one digit:
Regex: ^(?=.*[a-z])(?=.*[A-Z])(?=.*\d).{8,}$
Validates: Passwords with all requirements met
Extract Domain from Email:
Regex: (?<=@)[a-zA-Z0-9.-]+
Text: "[email protected]"
Matches: "example.com"
Find Numbers Not in Parentheses:
Regex: (?<!\()\d+(?!\))
Text: "Call (555) 123-4567"
Matches: "123" and "4567" (not "555")
Pro tip: Lookaround assertions are powerful but can impact performance. Use them judiciously, especially in patterns that will process large amounts of text. Some regex engines have limited lookbehind support.
Common Regex Patterns Reference Table
Here's a comprehensive reference of frequently used regex patterns for common validation and extraction tasks. You can test these patterns using our Regex Tester Tool.
| Pattern Type | Regex | Description |
|---|---|---|
| Email Address | [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,} |
Basic email validation |
| URL | https?://[^\s]+ |
Simple URL matching |
| Phone (US) | \(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4} |
Various US phone formats |
| ZIP Code (US) | \d{5}(-\d{4})? |
5-digit or ZIP+4 |
| IP Address (IPv4) | \b(?:\d{1,3}\.){3}\d{1,3}\b |
Basic IPv4 format |
| Date (YYYY-MM-DD) | \d{4}-\d{2}-\d{2} |
ISO date format |
| Time (24-hour) | [0-2]\d:[0-5]\d |
HH:MM format |
| Hex Color | #[0-9A-Fa-f]{6}\b |
6-digit hex colors |
| Credit Card | \d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4} |
16-digit card numbers |
| Username | ^[a-zA-Z0-9_]{3,16}$ |
3-16 alphanumeric chars |
| Strong Password | ^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-z\d@$!%*?&]{8,}$ |
Min 8 chars with requirements |
| HTML Tag | <([a-z]+)([^<]+)*(?:>(.*?)<\/\1>|\s+\/>) |
Opening and closing tags |
Advanced Pattern Examples
| Use Case | Regex | Example Match |
|---|---|---|
| Extract hashtags | #\w+ |
#regex #tutorial |
| Extract mentions | @\w+ |
@username |
| File extension | \.\w+$ |
.txt, .pdf, .jpg |
| Remove extra spaces | \s+ |
Multiple spaces |
| Markdown links | \[([^\]]+)\]\(([^)]+)\) |
[text](url) |
| CSS class names | class="([^"]*)" |
class="container" |
| Extract numbers | -?\d+\.?\d* |
123, -45.67 |
| Trim whitespace | ^\s+|\s+$ |
Leading/trailing spaces |
You can also use our String Formatter Tool to clean and format text data after extraction.
Using Regex Across Different Languages
While regex syntax is largely consistent across languages, implementation details vary. Here's how to use regex in popular programming languages.
JavaScript
// Literal notation
const regex = /\d{3}-\d{3}-\d{4}/;
// Constructor notation
const regex2 = new RegExp('\\d{3}-\\d{3}-\\d{4}');
// Test if pattern matches
regex.test('555-123-4567'); // true
// Extract matches
const match = '555-123-4567'.match(regex);
// Replace
const result = 'Call 555-123-4567'.replace(regex, 'XXX-XXX-XXXX');
// Global flag for multiple matches
const globalRegex = /\d+/g;
'a1b2c3'.match(globalRegex); // ['1', '2', '3']
Python
import re
# Compile pattern
pattern = re.compile(r'\d{3}-\d{3}-\d{4}')
# Test if pattern matches
pattern.search('555-123-4567') # Match object or None
# Find all matches
re.findall(r'\d+', 'a1b2c3') # ['1', '2', '3']
# Replace
re.sub(r'\d{3}-\d{3}-\d{4}', 'XXX-XXX-XXXX', 'Call 555-123-4567')
# Split by pattern
re.split(r'\s+', 'split by spaces') # ['split