PHP Regex Complexity

Question

PHP Regex Complexity

I am having difficulties with regexes when there are spaces and carriage returns between the text.

For example, in this case, below is how to get the regex to get " <div id="contentleft">

"?

<div id="content"> 


<div id="contentleft">  <SCRIPT language=JavaScript>

I tried

id="content">(.*?)<SCRIPT

but that won't work.

+1

php regex

chris May 24 '09 at 7:07

a source to share

6 answers

Take a look at the PCRE modifiers: http://ar2.php.net/manual/en/reference.pcre.pattern.modifiers.php

You can apply the s modifier, for example '/id="content">(.*?)<SCRIPT/s'

(Although, look, as it changes the way ^

, $

too.

Otherwise, you can do '/id="content">((.|\n)*?)<SCRIPT/'

EDIT: oops, wrong modifier ...

+1

Tordek May 24 '09 at 7:13

a source to share

Try

id="content">((?:.|\n)*?)<SCRIPT

The usual warning not to parse HTML with regex applies, but you seem to already know that.

As an alternative:

(?<=id="content">)(?:.|\n)*?(?=<SCRIPT)

Period does not match the default newline characters. One way to get around this is to explicitly allow them. This would work even if the "dotall" modifier did not help you when using the regular expression.

The first regex is equal to your approach extended with \n

. Your match will be in group 1, you only need to crop it.

The second regular expression uses zero-width assertions (look-ahead / look-behind) to mark the beginning and end of a match. There would be nothing in a match that you don't need without the need for grouping.

0

Tomalak May 24 '09 at 8:10

a source to share

Another solution without regex:

$start = 'id="content">';
$end = '<SCRIPT';
if (($startPos = strpos($str, $start)) !== false &&
    ($endPos = strpos($str, $end, $startPos+1)) !== false) {
    $substr = substr($str, $startPos, $endPost-$startPos);
}

0

Gumbo May 24 '09 at 8:33

a source to share

Well this is a multiple line issue, so take a look at the template modifiers:

m (PCRE_MULTILINE) By default, PCRE treats the subject line as consisting of one "line" of characters (even if it actually contains multiple newlines). The beginning of a line "metacharacter (^)" matches only at the beginning of a line, while the "end of line" ($) metacharacter only matches at the end of a line, or until a newline (unless the D modifier is set). This is the same as Perl.

When this modifier is set, "start of lines" and "end of line" match immediately after or immediately before any new line in, respectively, both at the very beginning and at the end. This is equivalent to the Perl / m modifier. If there are no \ n "characters in the subject line or there are no ^ or $ occurrences in the pattern, setting this modifier has no effect.

s (PCRE_DOTALL) If this modifier is set, the dotted metacharacter in the pattern matches all characters, including newlines. Without this, new lines are excluded. This modifier is equivalent to the Perl / s modifier. a negative class such as [^ a] always matches a newline character, regardless of the setting of this Modifier.

from http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php

0

coma May 24 '09 at 8:50

a source to share

$dom = new DOMDocument();
$dom->strictErrorChecking = false;
$dom->loadHTML($html_str);

$xpath = new DOMXPath($dom);
$div = $xpath->query('div[@id="content"]')->item(0);

Please correct my xpath expression - not sure if this will work ...

0

Jet May 24 '09 at 12:46

a source to share

Schwern · Accepted Answer · 2009-05-24T09:02:45+0000

$s = '<div id="content">

<div id="contentleft">  <SCRIPT language=JavaScript>';

if( preg_match('/id="content">(.*?)<SCRIPT/s', $s, $matches) )
    print $matches[1]."\n";

Dot, by default, matches everything except newlines. /s

does anything.

But really, use a DOM parser. You can walk the tree, or you can use an XPath query. Think of it like regular expressions for XML.

$s = '<div id="content">

<div id="contentleft">  <SCRIPT language=JavaScript>';

// Load the HTML
$doc = new DOMDocument();
$doc->loadHTML($s);

// Use XPath to find the <div id="content"> tag descendants.
$xpath = new DOMXPath($doc);
$entries = $xpath->query("//div[@id='content']/descendant::*");

foreach( $nodes as $node ) {
    // Stop when we see <script ...>
    if( $node->nodeName == "script" )
        break;

    // do what you want with the content
}

XPath is extremely powerful. Here are some examples.

PS I'm sure (hopefully) that the above code may be tightened by some.

PHP Regex Complexity

More articles: