Most languages make regex easy. Python gives you re.compile. JavaScript gives you /pattern/flags. C gives you POSIX headers, raw structs, and the full weight of manual memory management.
That’s not a complaint. Working close to the metal clarifies what regex actually is — compiled finite automata operating on byte sequences.
The POSIX Regex API
POSIX regex in C lives in <regex.h> and gives you two primary functions:
#include <stdio.h>
#include <regex.h>
#include <stdlib.h>
int match(const char *pattern, const char *string) {
regex_t regex;
int result;
/* Compile the pattern */
result = regcomp(®ex, pattern, REG_EXTENDED);
if (result != 0) {
char errbuf[128];
regerror(result, ®ex, errbuf, sizeof(errbuf));
fprintf(stderr, "regcomp error: %s\n", errbuf);
return -1;
}
/* Execute the match */
result = regexec(®ex, string, 0, NULL, 0);
/* Always free the compiled regex */
regfree(®ex);
return result == 0 ? 1 : 0;
}
int main(void) {
printf("%d\n", match("^[0-9]+$", "12345")); /* 1 */
printf("%d\n", match("^[0-9]+$", "123ab")); /* 0 */
return 0;
}
Capturing Groups
To capture substrings you need regmatch_t — an array of match structs holding start/end offsets:
#include <stdio.h>
#include <regex.h>
#include <string.h>
void extract_groups(const char *pattern, const char *string, size_t nmatch) {
regex_t regex;
regmatch_t matches[nmatch];
if (regcomp(®ex, pattern, REG_EXTENDED) != 0) {
fprintf(stderr, "Invalid pattern\n");
return;
}
if (regexec(®ex, string, nmatch, matches, 0) == 0) {
for (size_t i = 0; i < nmatch; i++) {
if (matches[i].rm_so == -1) break;
/* rm_so / rm_eo are byte offsets into the input string */
int len = matches[i].rm_eo - matches[i].rm_so;
printf("Group %zu: %.*s\n", i, len, string + matches[i].rm_so);
}
}
regfree(®ex);
}
int main(void) {
/* Extract year, month, day from an ISO date */
extract_groups(
"([0-9]{4})-([0-9]{2})-([0-9]{2})",
"Today is 2016-05-06 and tomorrow is 2016-05-07.",
4 /* full match + 3 groups */
);
return 0;
}
Output:
Group 0: 2016-05-06
Group 1: 2016
Group 2: 05
Group 3: 06
What Can Go Wrong
Forgetting regfree. Each regcomp allocates internal state. Missing regfree leaks memory — silently, in long-running processes.
Stack-allocating large regmatch_t arrays. For patterns with many groups, allocate on the heap.
REG_EXTENDED vs basic regex. Without REG_EXTENDED, +, ?, |, and () lose their special meaning or require backslash escaping. Always use REG_EXTENDED unless you have a specific reason not to.
Thread safety. regcomp / regexec / regfree are thread-safe. The compiled regex_t struct is not — don’t share it between threads without a mutex.
When C Regex Makes Sense
Mostly when you’re already writing C and need lightweight pattern matching without pulling in PCRE or another library. For anything complex, use a language with a richer regex API and garbage collection.
But understanding the C API clarifies regex semantics that higher-level wrappers hide. regmatch_t.rm_so and rm_eo are byte offsets, not character indices — a distinction that matters the moment your input contains multibyte UTF-8 sequences.