pcre2partial man page on DragonFly

Man page or keyword search:  
man Server   44335 pages
apropos Keyword Search (all sections)
Output format
DragonFly logo
[printable version]

PCRE2PARTIAL(3)						       PCRE2PARTIAL(3)

NAME
       PCRE2 - Perl-compatible regular expressions

PARTIAL MATCHING IN PCRE2

       In  normal  use	of  PCRE2,  if	the subject string that is passed to a
       matching function matches as far as it goes, but is too short to	 match
       the  entire pattern, PCRE2_ERROR_NOMATCH is returned. There are circum‐
       stances where it might be helpful to distinguish this case  from	 other
       cases in which there is no match.

       Consider, for example, an application where a human is required to type
       in data for a field with specific formatting requirements.  An  example
       might be a date in the form ddmmmyy, defined by this pattern:

	 ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$

       If the application sees the user's keystrokes one by one, and can check
       that what has been typed so far is potentially valid,  it  is  able  to
       raise  an  error	 as  soon  as  a  mistake  is made, by beeping and not
       reflecting the character that has been typed, for example. This immedi‐
       ate  feedback is likely to be a better user interface than a check that
       is delayed until the entire string has been entered.  Partial  matching
       can  also be useful when the subject string is very long and is not all
       available at once.

       PCRE2 supports partial matching by means of the PCRE2_PARTIAL_SOFT  and
       PCRE2_PARTIAL_HARD  options,  which  can be set when calling a matching
       function.  The difference between the two options is whether or	not  a
       partial match is preferred to an alternative complete match, though the
       details differ between the two types  of	 matching  function.  If  both
       options are set, PCRE2_PARTIAL_HARD takes precedence.

       If  you	want to use partial matching with just-in-time optimized code,
       you must call pcre2_jit_compile() with one or both of these options:

	 PCRE2_JIT_PARTIAL_SOFT
	 PCRE2_JIT_PARTIAL_HARD

       PCRE2_JIT_COMPLETE should also be set if you are going to run  non-par‐
       tial  matches  on the same pattern. If the appropriate JIT mode has not
       been compiled, interpretive matching code is used.

       Setting a partial matching option  disables  two	 of  PCRE2's  standard
       optimizations. PCRE2 remembers the last literal code unit in a pattern,
       and abandons matching immediately if it is not present in  the  subject
       string.	This  optimization  cannot  be	used for a subject string that
       might match only partially. PCRE2 also knows the minimum	 length	 of  a
       matching	 string,  and  does not bother to run the matching function on
       shorter strings. This optimization is also disabled for partial	match‐
       ing.

PARTIAL MATCHING USING pcre2_match()

       A  partial  match occurs during a call to pcre2_match() when the end of
       the subject string is reached successfully, but	matching  cannot  con‐
       tinue because more characters are needed. However, at least one charac‐
       ter in the subject must have been inspected. This  character  need  not
       form part of the final matched string; lookbehind assertions and the \K
       escape sequence provide ways of inspecting characters before the	 start
       of  a matched string. The requirement for inspecting at least one char‐
       acter exists because an empty string can	 always	 be  matched;  without
       such  a	restriction  there would always be a partial match of an empty
       string at the end of the subject.

       When a partial match is returned, the first two elements in the ovector
       point to the portion of the subject that was matched, but the values in
       the rest of the ovector are undefined. The appearance of \K in the pat‐
       tern has no effect for a partial match. Consider this pattern:

	 /abc\K123/

       If it is matched against "456abc123xyz" the result is a complete match,
       and the ovector defines the matched string as "123", because \K	resets
       the  "start  of	match" point. However, if a partial match is requested
       and the subject string is "456abc12", a partial match is found for  the
       string  "abc12",	 because  all these characters are needed for a subse‐
       quent re-match with additional characters.

       What happens when a partial match is identified depends on which of the
       two partial matching options are set.

   PCRE2_PARTIAL_SOFT WITH pcre2_match()

       If  PCRE2_PARTIAL_SOFT  is  set when pcre2_match() identifies a partial
       match, the partial match is remembered, but matching continues as  nor‐
       mal,  and  other	 alternatives in the pattern are tried. If no complete
       match  can  be  found,  PCRE2_ERROR_PARTIAL  is	returned  instead   of
       PCRE2_ERROR_NOMATCH.

       This  option  is "soft" because it prefers a complete match over a par‐
       tial match.  All the various matching items in a pattern behave	as  if
       the  subject string is potentially complete. For example, \z, \Z, and $
       match at the end of the subject, as normal, and for \b and \B  the  end
       of the subject is treated as a non-alphanumeric.

       If  there  is more than one partial match, the first one that was found
       provides the data that is returned. Consider this pattern:

	 /123\w+X|dogY/

       If this is matched against the subject string "abc123dog", both	alter‐
       natives	fail  to  match,  but the end of the subject is reached during
       matching, so PCRE2_ERROR_PARTIAL is returned. The offsets are set to  3
       and  9, identifying "123dog" as the first partial match that was found.
       (In this example, there are two partial matches, because "dog"  on  its
       own partially matches the second alternative.)

   PCRE2_PARTIAL_HARD WITH pcre2_match()

       If  PCRE2_PARTIAL_HARD is set for pcre2_match(), PCRE2_ERROR_PARTIAL is
       returned as soon as a partial match is  found,  without	continuing  to
       search  for possible complete matches. This option is "hard" because it
       prefers an earlier partial match over a later complete match. For  this
       reason,	the  assumption	 is  made that the end of the supplied subject
       string may not be the true end of the available data, and  so,  if  \z,
       \Z,  \b, \B, or $ are encountered at the end of the subject, the result
       is PCRE2_ERROR_PARTIAL, provided that at least  one  character  in  the
       subject has been inspected.

   Comparing hard and soft partial matching

       The  difference	between the two partial matching options can be illus‐
       trated by a pattern such as:

	 /dog(sbody)?/

       This matches either "dog" or "dogsbody", greedily (that is, it  prefers
       the  longer  string  if	possible). If it is matched against the string
       "dog" with PCRE2_PARTIAL_SOFT, it yields a complete  match  for	"dog".
       However,	 if  PCRE2_PARTIAL_HARD is set, the result is PCRE2_ERROR_PAR‐
       TIAL. On the other hand, if the pattern is made ungreedy the result  is
       different:

	 /dog(sbody)??/

       In  this	 case  the  result  is always a complete match because that is
       found first, and matching never	continues  after  finding  a  complete
       match. It might be easier to follow this explanation by thinking of the
       two patterns like this:

	 /dog(sbody)?/	  is the same as  /dogsbody|dog/
	 /dog(sbody)??/	  is the same as  /dog|dogsbody/

       The second pattern will never match "dogsbody", because it will	always
       find the shorter match first.

PARTIAL MATCHING USING pcre2_dfa_match()

       The DFA functions move along the subject string character by character,
       without backtracking, searching for  all	 possible  matches  simultane‐
       ously.  If the end of the subject is reached before the end of the pat‐
       tern, there is the possibility of a partial match, again provided  that
       at least one character has been inspected.

       When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if
       there have been no complete matches. Otherwise,	the  complete  matches
       are  returned.	However, if PCRE2_PARTIAL_HARD is set, a partial match
       takes precedence over any complete matches. The portion of  the	string
       that was matched when the longest partial match was found is set as the
       first matching string.

       Because the DFA functions always search for all possible	 matches,  and
       there  is  no  difference between greedy and ungreedy repetition, their
       behaviour is different from  the	 standard  functions  when  PCRE2_PAR‐
       TIAL_HARD  is  set.  Consider  the  string  "dog"  matched  against the
       ungreedy pattern shown above:

	 /dog(sbody)??/

       Whereas the standard function stops as soon as it  finds	 the  complete
       match  for  "dog",  the	DFA  function also finds the partial match for
       "dogsbody", and so returns that when PCRE2_PARTIAL_HARD is set.

PARTIAL MATCHING AND WORD BOUNDARIES

       If a pattern ends with one of sequences \b or \B, which test  for  word
       boundaries,  partial matching with PCRE2_PARTIAL_SOFT can give counter-
       intuitive results. Consider this pattern:

	 /\bcat\b/

       This matches "cat", provided there is a word boundary at either end. If
       the subject string is "the cat", the comparison of the final "t" with a
       following character cannot take place, so a  partial  match  is	found.
       However,	 normal	 matching carries on, and \b matches at the end of the
       subject when the last character is a letter, so	a  complete  match  is
       found.	The  result,  therefore,  is  not  PCRE2_ERROR_PARTIAL.	 Using
       PCRE2_PARTIAL_HARD in this case does yield PCRE2_ERROR_PARTIAL, because
       then the partial match takes precedence.

EXAMPLE OF PARTIAL MATCHING USING PCRE2TEST

       If  the	partial_soft  (or  ps) modifier is present on a pcre2test data
       line, the PCRE2_PARTIAL_SOFT option is used for the match.  Here	 is  a
       run of pcre2test that uses the date example quoted above:

	   re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
	 data> 25jun04\=ps
	  0: 25jun04
	  1: jun
	 data> 25dec3\=ps
	 Partial match: 23dec3
	 data> 3ju\=ps
	 Partial match: 3ju
	 data> 3juj\=ps
	 No match
	 data> j\=ps
	 No match

       The  first  data	 string	 is matched completely, so pcre2test shows the
       matched substrings. The remaining four strings do not  match  the  com‐
       plete pattern, but the first two are partial matches. Similar output is
       obtained if DFA matching is used.

       If the partial_hard (or ph) modifier is present	on  a  pcre2test  data
       line, the PCRE2_PARTIAL_HARD option is set for the match.

MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()

       When  a	partial match has been found using a DFA matching function, it
       is possible to continue the match by providing additional subject  data
       and  calling  the function again with the same compiled regular expres‐
       sion, this time setting the PCRE2_DFA_RESTART option. You must pass the
       same working space as before, because this is where details of the pre‐
       vious partial match are stored. Here is an example using pcre2test:

	   re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
	 data> 23ja\=dfa,ps
	 Partial match: 23ja
	 data> n05\=dfa,dfa_restart
	  0: n05

       The first call has "23ja" as the subject, and requests  partial	match‐
       ing;  the  second  call	has  "n05"  as	the  subject for the continued
       (restarted) match.  Notice that when the match is  complete,  only  the
       last  part  is  shown;  PCRE2 does not retain the previously partially-
       matched string. It is up to the calling program to do that if it	 needs
       to.

       That means that, for an unanchored pattern, if a continued match fails,
       it is not possible to try again at  a  new  starting  point.  All  this
       facility	 is  capable  of  doing	 is continuing with the previous match
       attempt. In the previous example, if the second set of data  is	"ug23"
       the  result is no match, even though there would be a match for "aug23"
       if the entire string were given at once. Depending on the  application,
       this may or may not be what you want.  The only way to allow for start‐
       ing again at the next character is to retain the matched	 part  of  the
       subject and try a new complete match.

       You  can	 set the PCRE2_PARTIAL_SOFT or PCRE2_PARTIAL_HARD options with
       PCRE2_DFA_RESTART to continue partial matching over multiple  segments.
       This  facility can be used to pass very long subject strings to the DFA
       matching functions.

MULTI-SEGMENT MATCHING WITH pcre2_match()

       Unlike the DFA function, it is not possible  to	restart	 the  previous
       match with a new segment of data when using pcre2_match(). Instead, new
       data must be added to the previous subject string, and the entire match
       re-run,	starting from the point where the partial match occurred. Ear‐
       lier data can be discarded.

       It is best to use PCRE2_PARTIAL_HARD in this situation, because it does
       not  treat the end of a segment as the end of the subject when matching
       \z, \Z, \b, \B, and $. Consider	an  unanchored	pattern	 that  matches
       dates:

	   re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
	 data> The date is 23ja\=ph
	 Partial match: 23ja

       At  this stage, an application could discard the text preceding "23ja",
       add on text from the next  segment,  and	 call  the  matching  function
       again.  Unlike  the  DFA	 matching function, the entire matching string
       must always be available, and the complete matching process occurs  for
       each call, so more memory and more processing time is needed.

ISSUES WITH MULTI-SEGMENT MATCHING

       Certain types of pattern may give problems with multi-segment matching,
       whichever matching function is used.

       1. If the pattern contains a test for the beginning of a line, you need
       to  pass	 the  PCRE2_NOTBOL option when the subject string for any call
       does start at the beginning of a line. There  is	 also  a  PCRE2_NOTEOL
       option, but in practice when doing multi-segment matching you should be
       using PCRE2_PARTIAL_HARD, which includes the effect of PCRE2_NOTEOL.

       2. If a pattern contains a lookbehind assertion, characters  that  pre‐
       cede  the start of the partial match may have been inspected during the
       matching process.  When using pcre2_match(), sufficient characters must
       be  retained  for  the  next  match attempt. You can ensure that enough
       characters are retained by doing the following:

       Before doing any matching, find the length of the longest lookbehind in
       the     pattern	  by	calling	   pcre2_pattern_info()	   with	   the
       PCRE2_INFO_MAXLOOKBEHIND option. Note that the resulting	 count	is  in
       characters, not code units. After a partial match, moving back from the
       ovector[0] offset in the subject by the number of characters given  for
       the  maximum lookbehind gets you to the earliest character that must be
       retained. In a non-UTF or a 32-bit situation, moving  back  is  just  a
       subtraction,  but in UTF-8 or UTF-16 you have to count characters while
       moving back through the code units.

       Characters before the point you have now reached can be discarded,  and
       after  the  next segment has been added to what is retained, you should
       run the next match with the startoffset argument set so that the	 match
       begins at the same point as before.

       For  example, if the pattern "(?<=123)abc" is partially matched against
       the string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maxi‐
       mum  lookbehind	count  is  3, so all characters before offset 2 can be
       discarded. The value of startoffset for the next	 match	should	be  3.
       When  pcre2test	displays  a partial match, it indicates the lookbehind
       characters with '<' characters:

	   re> "(?<=123)abc"
	 data> xx123ab\=ph
	 Partial match: 123ab
			<<<

       3. Because a partial match must always contain at least one  character,
       what  might  be	considered a partial match of an empty string actually
       gives a "no match" result. For example:

	   re> /c(?<=abc)x/
	 data> ab\=ps
	 No match

       If the next segment begins "cx", a match should be found, but this will
       only  happen  if characters from the previous segment are retained. For
       this reason, a "no match" result	 should	 be  interpreted  as  "partial
       match of an empty string" when the pattern contains lookbehinds.

       4.  Matching  a subject string that is split into multiple segments may
       not always produce exactly the same result as matching over one	single
       long  string,  especially  when PCRE2_PARTIAL_SOFT is used. The section
       "Partial Matching and Word Boundaries" above describes  an  issue  that
       arises  if  the	pattern ends with \b or \B. Another kind of difference
       may occur when there are multiple matching possibilities, because  (for
       PCRE2_PARTIAL_SOFT) a partial match result is given only when there are
       no completed matches. This means that as soon as the shortest match has
       been  found,  continuation to a new subject segment is no longer possi‐
       ble. Consider this pcre2test example:

	   re> /dog(sbody)?/
	 data> dogsb\=ps
	  0: dog
	 data> do\=ps,dfa
	 Partial match: do
	 data> gsb\=ps,dfa,dfa_restart
	  0: g
	 data> dogsbody\=dfa
	  0: dogsbody
	  1: dog

       The first data line passes the string "dogsb" to	 a  standard  matching
       function, setting the PCRE2_PARTIAL_SOFT option. Although the string is
       a partial match for "dogsbody", the result is not  PCRE2_ERROR_PARTIAL,
       because	the  shorter string "dog" is a complete match. Similarly, when
       the subject is presented to a DFA matching function  in	several	 parts
       ("do"  and  "gsb"  being	 the first two) the match stops when "dog" has
       been found, and it is not possible to continue.	On the other hand,  if
       "dogsbody"  is  presented  as  a single string, a DFA matching function
       finds both matches.

       Because of these problems, it is best to	 use  PCRE2_PARTIAL_HARD  when
       matching	 multi-segment	data.  The  example above then behaves differ‐
       ently:

	   re> /dog(sbody)?/
	 data> dogsb\=ph
	 Partial match: dogsb
	 data> do\=ps,dfa
	 Partial match: do
	 data> gsb\=ph,dfa,dfa_restart
	 Partial match: gsb

       5. Patterns that contain alternatives at the top level which do not all
       start  with  the	 same  pattern	item  may  not	work  as expected when
       PCRE2_DFA_RESTART is used. For example, consider this pattern:

	 1234|3789

       If the first part of the subject is "ABC123", a partial	match  of  the
       first  alternative  is found at offset 3. There is no partial match for
       the second alternative, because such a match does not start at the same
       point  in  the  subject	string. Attempting to continue with the string
       "7890" does not yield a match  because  only  those  alternatives  that
       match  at  one  point in the subject are remembered. The problem arises
       because the start of the second alternative matches  within  the	 first
       alternative.  There  is	no  problem with anchored patterns or patterns
       such as:

	 1234|ABCD

       where no string can be a partial match for both alternatives.  This  is
       not  a  problem	if  a  standard matching function is used, because the
       entire match has to be rerun each time:

	   re> /1234|3789/
	 data> ABC123\=ph
	 Partial match: 123
	 data> 1237890
	  0: 3789

       Of course, instead of using PCRE2_DFA_RESTART, the  same	 technique  of
       re-running  the	entire	match  can  also be used with the DFA matching
       function. Another possibility is to work with two buffers. If a partial
       match  at  offset  n in the first buffer is followed by "no match" when
       PCRE2_DFA_RESTART is used on the second buffer, you can then try a  new
       match starting at offset n+1 in the first buffer.

AUTHOR

       Philip Hazel
       University Computing Service
       Cambridge, England.

REVISION

       Last updated: 22 December 2014
       Copyright (c) 1997-2014 University of Cambridge.

PCRE2 10.00		       22 December 2014		       PCRE2PARTIAL(3)
[top]

List of man pages available for DragonFly

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Tweet
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome
Free Shell Accounts :: the biggest list on the net