Regex Question: whether this pattern matches hard or soft quotes
I have a regex for regex:
$p = '%<a.*\s+name="(.*)"\s*>(?:.*)</a>%im';
It matches <a
, followed by zero or more, followed by a space, andname="
It captures names even if the class or id precedes the name in the anchor.
What I would like to add is the possibility of matching on name='
with a single quote ('), since sooner or later someone will do it.
Obviously, I could just add a second regex written for this, but it seems inelegant.
Does anyone know how to add a single quote and just use one regex? Any other improvements or recommendations would be greatly appreciated. I can use all the regex help I can get!
Thanks a lot for reading,
function findAnchors($html) {
$names = array();
$p = '%<a.*\s+name="(.*)"\s*>(?:.*)</a>%im';
$t = preg_match_all($p, $html, $matches, PREG_SET_ORDER);
if ($matches) {
foreach ($matches as $m) {
$names[] = $m[1];
}
return $names;
}
}
Try the following:
/<a(?:\s+(?!name)[^"'>]+(?:"[^"]*"|'[^']*')?)*\s+name=("[^"]*"|'[^']*')\s*>/im
Here you just need to strip the surrounding quotes:
substr($match[1], 1, -1)
But using a real parser like DOMDocument will undoubtedly be better than this regex.
James' comment is actually a very popular but incorrect regex used for string matching. This is wrong because it doesn't avoid the line separator. Given that the line separator is "or", the following regex is executed
$regex = '([\'"])(.*?)(.{0,2})(?<![^\\\]\\\)(\1)';
\ 1 is the starting divisor, \ 2 is the content (minus 2 characters), and \ 3 is the last 2 characters and the ending separator. This regex allows you to escape delimiters as long as the escape character is \ and the escape character has not been escaped. IE.
'Valid'
'Valid \' String'
'Invalid ' String'
'Invalid \\' String'
Use []
to match character sets:
$p = "%<a.*\s+name=['\"](.*)['\"]\s*>(?:.*)</a>%im";
Your current solution will not match anchors with other attributes following the name (for example <a name="foo" id="foo">
).
Try:
$regex = '%<a\s+\S*\s*name=["']([^"']+)["']%i';
This will extract the contents of the name attribute into the backlink $1
. \s*
will also allow line breaks between attributes.
You don't have to end up with the rest of the a
' tag , as the negative character class [^"']+
will be lazy.
Here's a different approach:
$rgx='~<a(?:\s+(?>name()|\w+)=(?|"([^"]*)"|\'([^\']*)\'))+?\1~i';
I know this question is old, but when it came up just now, I came up with another use for the "empty capture groups as checkboxes" idiom from the Cookbook . The first non-capturing group handles matching all "name = value" pairs under the control of the reluctant plus ( +?
). If the attribute name is literal name
, empty group ( ()
) doesn't match anything, then \1
backreference ( ) doesn't repeat anything else, breaking out of the loop. (Backreference succeeds because the group participated in the match, even though it was not consuming any symbols.)
The attribute value is fixed each time in group # 2, overwriting everything that was fixed at the previous iteration. (The branch-reset construct ( (?|(...)|(...))
allows us to "reuse" group # 2 to capture the value within the quotes, depending on the kind of quotes.) Since the loop exits after the name name
up appears , the final committed value matches this attribute.
See demo at Ideone