![]() |
JPCRE2
10.32.01
C++ wrapper for PCRE2 library
|
C++ wrapper for PCRE2 library
PCRE2 is the name used for a revised API for the PCRE library, which is a set of functions, written in C, that implement regular expression pattern matching using the same syntax and semantics as Perl, with just a few differences. Some features that appeared in Python and the original PCRE before they appeared in Perl are also available using the Python syntax.
This provides some C++ wrapper classes/functions to perform regex operations such as regex match and regex replace.
You can read the complete documentation here or download it from jpcre2-doc repository.
version >=10.21
).If the required PCRE2 version is not available in the official channel, you can download my fork of the library.
This is a header only library. All you need to do is include the header jpcre2.hpp
in your program.
Notes:
jpcre2.hpp
#includes pcre2.h
, thus you don't need to include pcre2.h
manually in your program.pcre2.h
is in a non-standard path then you may include it before jpcre2.hpp
with correct path (you will need to define PCRE2_CODE_UNIT_WIDTH
before including pcre2.h
in this case)PCRE2_CODE_UNIT_WIDTH
before including jpcre2.hpp
.PCRE2_STATIC
before including jpcre2.hpp
(or before pcre2.h
if you included it manually).Install:
You can copy this header to a standard include directory (folder) so that it becomes available from a standard include path.
Download or clone the release branch unless you want the master (continuous dev) branch specifically:
On Unix you can do:
It will check if all dependencies are satisfied and install the header in a standard include path.
Compile/Build:
Compile/Build your code with corresponding PCRE2 libraries linked. For 8-bit code unit width, you need to link with 8-bit library, for 16-bit, 16-bit library and so on. If you want to use multiple code unit width, link against all 8-bit, 16-bit and 32-bit libraries. See code unit width and character type for details.
Example compilation with g++:
If PCRE2 is not installed in the standard path, add the path with -L
option:
Performing a match or replacement against regex pattern involves two steps:
Select a character type according to the library you want to use. In this doc we are going to use 8 bit library as reference and we will use char
as the character type. If char
in your system is 16-bit you will have to link against 16-bit library instead, same goes for 32-bit. Other bit sizes are not supported by PCRE2.
Let's use a typedef to shorten the code:
(You can use temporary objects too, see short examples).
This object will hold the pattern, options and compiled pattern.
Each object for each regex pattern.
Now you can perform match or replace against the pattern. Use the RegexMatch::match()
function to perform regex match and the RegexReplace::replace()
member function to perform regex replace.
You can check if the regex was compiled successfully or not, but it's not necessary. A match against a non-compiled regex will give you 0 match and for replace you will be returned the exact same subject string that you passed.
The if(re)
conditional is only available for >= C++11
:
For < C++11
, you can use the double bang trick as an alternative to if(re)
:
Match is generally performed using the jp::RegexMatch::match()
function.
For convenience, a shortcut function in Regex
is available: jp::Regex::match()
. It can take upto three arguments. It uses a temporary match object to perform the match.
To get match results, you will need to pass vector pointers that will be filled with match data.
The g
modifier performs global match.
To get the match results, you need to pass appropriate vector pointers. This is an example of how you can get the numbered substrings/captured groups from a match:
You can access a substring/captured group by specifying their index (position):
To get named substring and/or name to number mapping, pass pointer to the appropriate vectors with jp::RegexMatch::setNamedSubstringVector()
and/or jp::RegexMatch::setNameToNumberMapVector()
before doing the match.
If you need this information, you should have passed a jp::VecNtN
pointer to jp::RegexMatch::setNameToNumberMapVector()
function before doing the match (see above).
You can iterate through the matches for numbered substrings (jp::VecNum
) like this:
You can iterate through named substrings (jp::VecNas
) like this:
If you are using >=C++11
, you can make the loop a lot simpler:
jp::VecNtN
can be iterated through the same way as jp::VecNas
.
Every match object needs to be associated with a Regex object. A match object without regex object associated with it, will always give 0 match.
The RegexMatch
class stores a pointer to its' associated Regex object. If the content of the associated Regex object is changed, it will be reflected on the next operation/result.
Regex replace is generally performed using the jp::RegexReplace::replace()
function.
However, a convenience shortcut function is available in Regex class: jp::Regex::replace(subject, replacewith, modifier)
. It uses a temporary replace object to perform the replacement.
Every replace object needs to be associated with a Regex object. A replace object not associated with any Regex object will perform no replacement and return the same subject string that was given.
The RegexReplace
class stores a pointer to its' associated Regex object. If the content of the associated Regex object is changed, it will be reflected on the next operation/result.
The jp::RegexReplace
class has two replace functions: jp::RegexReplace::replace()
and jp::RegexReplace::nreplace()
. Both of them can take a jp::MatchEvaluator
instance as argument and perform the replace operation according to the callback function set in the MatchEvaluator class.
And those two are just wrappers of jp::MatchEvaluator::replace()
and jp::MatchEvaluator::nreplace()
. Using these functions directly, one can re-use existing match data for new replacement operation without doing the match again. Though, this facility comes with some quirks, see Re-use match data section.. By default all replace functions do a new match every time and re-create the match data.
The first function mentioned (replace()
) above, is for PCRE2 compatible replacement which uses pcre2_substitute
to process the replacement string returned by the callback function, where the second one (nreplace()
) uses a native approach without using pcre2_substitute
which treats the string returned by the callback function as literal.
The class MatchEvaluator
implements several constructor overloads to take different callback functions. Also, there are setter functions which allow changing the callback functions if desired.
The callback function takes exactly three positional arguments. If you don't need one or more arguments, you may pass void*
in their respective positions in the argument list.
The callback function:
then,
Detailed examples are in the testme.cpp file.
Replacement can be done with only MatchEvaluator:
A MatchEvaluator
object can be created using one of its many constructors. Callback functions can be provided with the constructors or can be changed later with jp::MatchEvaluator::setCallback()
function. If no callback function is set/given, then the default callback function is jp::callback::erase()
which deletes matched part/s from the subject string.
It is possible to use existing match data to perform replacement without performing a new match operation.
Safest way but not the best:
Best but not the safest:
Instead of creating data for all vectors, you can do it as necessary, but it requires you to be vigilant about what you are doing:
Let's say, we have a callback cb3
that implements NumSub and MapNas and we do this:
Now, if we want to perform the replacement with a different callback function cb2
which implements only MapNas or NumSub or both, we can re-use the data created above:
If we want to use a callback function cb4
which implements jp::MapNtN
, we can not re-use the existing data because there is no data for jp::MapNtn
yet. (it will give assertion error if we try). Thus we will need to do the match again:
After the above operation, all the vectors are filled with data (missing jp::MapNtn
was created), consequently, we can use any callback function we want at this stage because we have all the data that we will need.
Thus a callback cb7
that implements all match data vectors can be used without doing the match again:
Quirks:
Make sure you at least understand the #3 and #4 points above before going for practical implementation of re-using match data. see jpcre2::select::MatchEvaluator for details
JPCRE2 uses a default set of modifier to provide an easy path to setting different options for different operations. There are three basic operations, namely compile, match and replace and thus the set is divided into three subset of modifiers. For convenience, we call them modifier tables.
If the default modifier table is not suitable for your application, you may use a custom modifier table instead of the default one. The jpcre2::ModifierTable
class provides this interface. (note the namespace, it's directly under jpcre2
).
All modifier strings are parsed and converted to equivalent PCRE2 and JPCRE2 options on the fly. If you don't want it to spend any time parsing modifier then pass the equivalent option directly with one of the many variants of
addJpcre2Option()
andaddPcre2Option()
functions.Be careful when you pass these options. A common mistake is to pass compile related options such as
PCRE2_CASELESS
(modifier i) to match operation;PCRE2_CASELESS
needs to be compiled in the regex, passing it during match will have no effect.
Types of modifiers:
All of the modifiers above can be divided further into two categories:
These modifiers define the behavior of a regex pattern (they are integrated in the compiled regex). They have more or less the same meaning as the PHP regex modifiers except for e, j and n
(marked with *).
Modifier | Details |
---|---|
e * | Unset back-references in the pattern will match to empty strings. Equivalent to PCRE2_MATCH_UNSET_BACKREF . |
i | Case-insensitive. Equivalent to PCRE2_CASELESS option. |
j * | \u \U \x and unset back-references will act as JavaScript standard. Equivalent to PCRE2_ALT_BSUX | PCRE2_MATCH_UNSET_BACKREF.
|
m | Multi-line regex. Equivalent to PCRE2_MULTILINE option. |
n * | Enable Unicode support for \w \d etc... in pattern. Equivalent to PCRE2_UTF | PCRE2_UCP. |
s | If this modifier is set, a dot meta-character in the pattern matches all characters, including newlines. Equivalent to PCRE2_DOTALL option. |
u | Enable UTF support.Treat pattern and subjects as UTF strings. It is equivalent to PCRE2_UTF option. |
x | Whitespace data characters in the pattern are totally ignored except when escaped or inside a character class, enables commentary in pattern. Equivalent to PCRE2_EXTENDED option. |
A | Match only at the first position. It is equivalent to PCRE2_ANCHORED option. |
D | A dollar meta-character in the pattern matches only at the end of the subject string. Without this modifier, a dollar also matches immediately before the final character if it is a newline (but not before any other newlines). This modifier is ignored if m modifier is set. Equivalent to PCRE2_DOLLAR_ENDONLY option. |
J | Allow duplicate names for sub-patterns. Equivalent to PCRE2_DUPNAMES option. |
S | When a pattern is going to be used several times, it is worth spending more time analyzing it in order to speed up the time taken for matching/replacing. It may also be beneficial for a very long subject string or pattern. Equivalent to an extra compilation with JIT_COMPILER with the option PCRE2_JIT_COMPLETE . |
U | This modifier inverts the "greediness" of the quantifiers so that they are not greedy by default, but become greedy if followed by ? . Equivalent to PCRE2_UNGREEDY option. |
These modifiers are not compiled in the regex itself, rather they are used per call of each match or replace function.
Modifier | Action | Details |
---|---|---|
A | match | Match at start. Equivalent to PCRE2_ANCHORED . Can be used in match operation. Setting this option only at match time (i.e regex was not compiled with this option) will disable optimization during match time. |
e | replace | Replaces unset group with empty string. Equivalent to PCRE2_SUBSTITUTE_UNSET_EMPTY . |
E | replace | Extension of e modifier. Sets even unknown groups to empty string. Equivalent to PCRE2_SUBSTITUTE_UNSET_EMPTY | PCRE2_SUBSTITUTE_UNKNOWN_UNSET |
g | match replace | Global. Will perform global matching or replacement if passed. Equivalent to jpcre2::FIND_ALL for match and PCRE2_SUBSTITUTE_GLOBAL for replace. |
x | replace | Extended replacement operation. Equivalent to PCRE2_SUBSTITUTE_EXTENDED . It enables some Bash like features:${<n>:-<string>} ${<n>:+<string1>:<string2>} <n> may be a group number or a name. The first form specifies a default value. If group <n> is set, its value is inserted; if not, <string> is expanded and the result is inserted. The second form specifies strings that are expanded and inserted when group <n> is set or unset, respectively. The first form is just a convenient shorthand for ${<n>:+${<n>}:<string>} . |
Modifier table is an instance of the jpcre2::ModifierTable
class. You can bind this table with any of the compile, match and replace related class objects. Different objects can have different tables.
Examples:
For details, see the testmd.cpp
file.
JPCRE2 allows both PCRE2 and native JPCRE2 options to be passed. PCRE2 options are recognized by the PCRE2 library itself.
These options are meaningful only for the JPCRE2 library, not the original PCRE2 library. We use the jp::Regex::addJpcre2Option()
family of functions to pass these options.
Option | Details |
---|---|
jpcre2::NONE | This is the default option. Equivalent to 0 (zero). |
jpcre2::FIND_ALL | This option will do a global match if passed during matching. The same can be achieved by passing the 'g' modifier with jp::RegexMatch::addModifier() function. |
jpcre2::JIT_COMPILE | This is same as passing the S modifier during pattern compilation. |
We use the jp::Regex::addPcre2Option()
family of functions to pass the PCRE2 options. These options are the same as the PCRE2 library and have the same meaning. For example instead of passing the 'g' modifier to the replacement operation we can also pass its PCRE2 equivalent PCRE2_SUBSTITUTE_GLOBAL
to have the same effect. Passing these options directly will be faster than passing modifiers.
This is where deviations from the PCRE2 specification will be laid out.
Details | PCRE2 | JPCRE2 |
---|---|---|
Different name for same group | not supported (10.21 ) | supported (>=10.30.01 ) |
The bit size of character type must match with the PCRE2 library you are linking against. There are three PCRE2 libraries according to code unit width, namely 8, 16 and 32 bit libraries. So, if you use a character type (e.g char
which is generally 8 bit) of 8-bit code unit width then you will have to link your program against the 8-bit PCRE2 library. If it's 16-bit character, you will need 16-bit library. If you use a combination of various code unit width supported or use all of them, you will have to link your program against their corresponding PCRE2 libraries. Missing library will yield to compile time error.
Implementation defined behavior:
Size of integral types (char
, wchar_t
, char16_t
, char32_t
) is implementation defined. char
may be 8, 16, 32 or 64 (not supported) bit. Same goes for wchar_t
and others. In Linux wchar_t
is 32 bit and in windows it's 16 bit.
JPCRE2 codes are portable in regards of code unit width. Your program gets compiled according to the code unit width defined by your system. Consider the following example, where you do:
This is what will happen when you compile:
char
is 8 bit, it will use 8-bit library and UTF-8 in UTF-mode.char
is 16 bit, it will use 16-bit library and UTF-16 in UTF-mode.char
is 32 bit, it will use 32-bit library and UTF-32 in UTF-mode.char
is not 8 or 16 or 32 bit, it will yield compile error.If you don't want to be so aware of the code unit width of the character type/s you are using, link your program against all PCRE2 libraries. The code unit width will be handled automatically and if anything unsupported is encountered, you will get compile time error.
A common example in this regard can be the use of wchar_t
:
For portable code, instead of using the standard names std::string
or such, use jp::String
(you may further typedef it as String
or whatever). It will be defined to an appropriate string class according to the basic character type you selected and thus provide all the functionalities and conveniences you get with std::string
and such string classes. Being said that, there's no harm if you use the standard names (std::string
etc...). Using jp::String
will just ensure that you are using the correct string class for the correct character type you selected. If you need to use the basic character type, use jp::Char
.
Instead of using full names like std::vector<std::string>
and such for storing match result, use the typedefs:
jp::NumSub
: Equivalent to std::vector<jp::String>
jp::MapNas
: Equivalent to std::map<jp::String, jp::String>
(You can set arbitrary map (e.g std::unordered_map
) instead of std::map
when using >=C++11
)jp::MapNtN
: Equivalent to std::map<jp::String, size_t>
(You can set arbitrary map (e.g std::unordered_map
) instead of std::map
when using >=C++11
)jp::VecNum
: Equivalent to std::vector<jp::NumSub>
jp::VecNas
: Equivalent to std::vector<jp::MapNas>
jp::VecNtN
: Equivalent to std::vector<jp::MapNtN>
jpcre2::VecOff
: Equivalent to std::vector<size_t>
(note the namespace, it's directly under jpcre2
)Other typedefs are mostly for internal use
jpcre2::Ush
as unsigned short. In JPCRE2 context, it is the smallest unsigned integer type to cover at least the numbers from 1 to 126.jpcre2::Uint
is a fixed width unsigned integer type and will be at least 32 bit wide.jpcre2::SIZE_T
is the same as PCRE2_SIZE
which is defined as size_t
.jpcre2::VecOpt
is defined as std::vector<jpcre2::Uint>
.When a known error is occurred during pattern compilation or match or replace, the error number and error offsets are set to corresponding variables of the respective classes. You can get the error number, error offset and error message with getErrorNumber()
, getErrorOffset()
and getErrorMessage()
functions respectively. These functions are available for all three classes.
Note that, these errors always gets overwritten by previous error, so you only get the last error that occurred.
Also note that, these errors never get re-initialized (set to zero), they are always there even when everything else worked great (except some previous error).
If you do experiment with various erroneous situations, make use of the resetErrors()
function. You can call it from anywhere in your method chain and immediately set the errors to zero. This function is also defined for all three classes to reset their corresponding errors.
JPCRE2 asserts some errors with descriptive error messages. These errors are mistakes in your code and not to be shipped to the client without fixing.
In no situation these errors should be bypassed by #define NDEBUG
before including jpcre2.hpp
. You should investigate the error message and fix the cause.
When there is no such errors in your finalized code, you may use
#define NDEBUG
to strip out these assertions.
JPCRE2 treats null as valid input and its usage have well-defined behavior throughout JPCRE2 interface. Most of the time a null is treated as 'set something to its initial or empty state'. And also, initial state doesn't necessarily have to be an empty state, and empty state doesn't necessarily have to be an initial state. It depends on what you are working with, refer to the doc when you are in a bind.
As an example, if null is passed with setSubject()
, then the subject is set to its initial state which is empty (not null).
Another example is, when a null is passed to the setRegexObject()
function, it literally sets the Regex object to null, which is actually the initial state for that calling object.
Giving a null to std::string
(and such) constructor is undefined behavior. But you don't need to worry about it with JPCRE2, if it's too much to type Two double quotes (""
) to pass an empty string to a JPCRE2 function, you can just use 0
, it's perfectly fine. But it's a bad practice, so just use this statement as a safety measure.
Note: JPCRE2 is supposed to be completely null safe, i.e no undefined behavior for null input. So, if you find any loophole or bug that makes this statement invalid, please report it.
(C) MT safe: All functions in JPCRE2 library are MT safe provided that the instances calling those functions are themselves thread safe.
When we say '(C) MT safe' or simply 'thread safe' throughout this doc, we mean the above definition of Conditional Multi-Thread safety.
Regex
, RegexMatch
, RegexReplace
etc..) because the classes do not contain any static variables.Examples:
The following function is thread safe:
The following function is thread safe for joined thread only:
Example multi-threaded programs are provided in src/test_pthread.cpp and src/teststdthread.cpp. The thread safety of these programs are tested with Valgrind (drd
tool). See Test suit for more details on the test.
>=C++11
features), use latest compilers with full C++11
support.>=C++11
, you will be OK with older compilers.Examples and test programs are available in src/test*.cpp
files.
File | Containing examples |
---|---|
test0.cpp | Handling std::string and std::wstring . |
test16.cpp | Performing regex match and regex replace with std::wstring and std::u16string . |
test32.cpp | Performing regex match and regex replace with std::wstring and std::u32string . |
test_match.cpp | Performing regex match against a pattern and getting the match count and match results. Shows how to iterate over the match results to get the captured groups/substrings. |
test_match2.cpp | Contains an example to take subject string, pattern and modifier from user input and perform regex match using JPCRE2. |
testmd.cpp | Examples of working with modifier table. |
testme.cpp | Examples of using MatchEvaluator to perform replace. |
test_replace.cpp | Example of doing regex replace. |
test_replace2.cpp | Contains an example to take subject string, replacement string, modifier and pattern from user input and perform regex replace with JPCRE2 |
test_pthread.cpp | Multi threaded examples with POSIX pthread. |
teststdthread.cpp | Multi threaded examples with std::thread . |
test_shorts.cpp | Contains some short examples. |
Some test programs are written to check for major flaws like segfault, memory leak and crucial input/output validation. Before trying to run the tests, make sure you have all 3 PCRE2 libraries installed on your system.
For the simplest (minimal) test, run:
To check with valgrind
, run:
To check the multi threaded examples with drd
, run:
To prepare a coverage report, run:
The configure script generated by autotools checks for availability of several programs and let's you set several options to control your testing environment. These are the options supported by configure scipt:
Option | Details |
---|---|
--[enable/disable]-test | Enable/Disable test suit. |
--[enable/disable]-cpp11 | Enable/Disable building tests with C++11 features. |
--[enable/disable]-valgrind | Enable/Disable valgrind test (memory leak test). |
--[enable/disable]-thread-check | Enable/Disable thread check on multi threaded examples. |
--[enable/disable]-coverage | Enable/Disable coverage report. |
--[enable/disable]-silent-rules | Enable/Disable silent rules (enabled by default). You will get prettified make output if enabled. |
Please do all pull requests against the master branch. The default branch is 'release' which is not where continuous development of JPCRE2 is done.
If you find any error in the documentation or confusing/misleading use of terms, or anything that cathces your eye and feels not right, please open an issue in the issue page. Or if you want to fix it and do pull request then use the master branch.
This page is generated from doxy/doxydoc.md file, thus changing the README.md file will have no impact.
This project comes with a BSD LICENCE, see the LICENCE file for more details.
It is not necessary to let me know which project you are using this library on, but an optional choice. I would very much appreciate it, if you let me know about the name (and short description if applicable) of the project. So if you have the time, please send me an email.