Splitting a complex string with regular expressions

How am I using regex to split this string :

string = "a[a=d b&c[e[100&2=34]]] e[cheese=blue and white] x[a=a b]"

      

into this array :

string.split( regexp ) =>

[ "a[a=d b&c[e[100&2=34]]]", "e[cheese=blue and white]", "x[a=a b]" ]

      

The general rule is that the line must be separated by spaces (\ s) , unless no spaces exist inside parentheses ([]) ;

+1


a source to share


4 answers


If this rule is simple, I would suggest just doing it manually. Go through each character and track the nesting level, increasing by 1 for each [and decreasing by 1 for each]. If you reach nesting location == 0 then split.



Edit: I thought I could also mention that there are other pattern matching options in some languages ​​that usually support such a thing. For example, in Lua you can use "% b []" to match balanced nested []. (Of course Lua doesn't have a built-in split function ....)

+4


a source


You can not; regular expressions are based on finite machines that don't have a "stack", so you can remember the number of levels of nesting.



But maybe you can use a trick: try converting the string to valid JSON string . Then you can use eval()

to parse it into JavaScript object.

+5


a source


Can you split by "(? <=]) \ S (? = [Az] [)"? those. space preceding a], followed by a letter and [? This assumes that you never had a single line inside parentheses, such as "a [b = d [x = yb] g [w = vb]]"

0


a source


Another is a looping approach where you deconstruct the nested parentheses one level at a time, otherwise it is difficult (TM) to get your single regex to work as expected.

Here's an example in ruby:

str = "a[a=d b&c[e[100&2=34]]] e[cheese=blue and white] x[a=a b]"
left = str.dup
tokn=0
toks=[]
# Deconstruct
loop do
  left.sub!(/\[[^\]\[]*\]/,"\{#{tokn}\}")
  break if $~.nil?
  toks[tokn]=$&
  tokn+=1
end
left=left.split(/\s+/)
# Reconstruct
(toks.size-1).downto(0) do |tokn|
  left.each { |str| str.sub!("\{#{tokn}\}", toks[tokn]) }
end

      

The above example uses {n} where n is an integer during deconstruction, so in some cases, original input like this in a string would ruin the reconstruction. This should illustrate the approach, though.

It is easier and safer to write code that performs the separation by iterating over characters.

Example in ruby:

str = "a[a=d b&c[e[100&2=34]]] e[cheese=blue and white] x[a=a b]"
toks=[]
level=st=en=0; 
str.each_byte do |c|
  en+=1; 
  level+=1 if c=='['[0]; 
  level-=1 if c==']'[0]; 
  if level==0 && c==' '[0]
    toks.push(str[st,en-1-st]);
    st=en
  end
end    
toks.push(str[st,en-st]) if st!=en 
p toks

      

0


a source







All Articles