Tux

...making Linux just a little more fun!

Regular Expressions

Deividson Okopnik [deivid.okop at gmail.com]


Thu, 17 Jul 2008 23:50:11 -0300

Quick regular expressions questions, I have a string and i want to return only whats inside the quotes inside that string - example the string is -> "Deividson" Okopnik <-, and i want only -> "Deividson" <-. Its guaranted that there will be only a single pair of double-quotes inside the string, but the lenght of the string inside it is not constant.

Im using PHP btw

Thanks

DeiviD


Top    Back


Ben Okopnik [ben at linuxgazette.net]


Thu, 17 Jul 2008 23:26:18 -0400

On Thu, Jul 17, 2008 at 11:50:11PM -0300, Deividson Okopnik wrote:

> Quick regular expressions questions, I have a string and i want to
> return only whats inside the quotes inside that string - example the
> string is -> "Deividson" Okopnik <-, and i want only -> "Deividson"
> <-. Its guaranted that there will be only a single pair of
> double-quotes inside the string, but the lenght of the string inside
> it is not constant.

Given that there's only one pair of double quotes, that's pretty easy. Assuming that you're using PHP's "preg_replace" function, and that your content is in a variable called $name:

echo preg_replace('/"(.*)"/', '$1', $name);

If there was more than one set of double quotes, and you wanted to make sure that you only got the content of the first one, you'd need to use a "balanced" capture. This is one of those classic regex methods that comes up all the time, and is well worth knowing.

echo preg_replace('/"([^"]+)"/', '$1', $name);

In Perl, you can comment regular expressions by using the '/x' option. I'll do that so I can explain what's going on:

/
"		# Match the opening double quote
(		# Begin capturing the content
[^"]+	# One or more characters which are NOT double quotes
)		# End capture (content will be in $1)
"		# Closing double quote
/x;

This is very common in processing HTML. Capturing tag content, for example, looks like this:

/<([^>]+)>/
-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back


Ben Okopnik [ben at linuxgazette.net]


Thu, 17 Jul 2008 23:43:56 -0400

On Thu, Jul 17, 2008 at 11:26:18PM -0400, Benjamin Okopnik wrote:

> 
> ``
> echo preg_replace('/"(.*)"/', '$1', $name);
> ''

Whoops - I just realized that I forgot to throw away the rest of the line (for some reason, I thought I was just extracting the matched part.) I always knew that doing PHP would rot my brain sooner or later. :)

echo preg_replace('/.*"(.*)".*/', '$1', $name);
-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back


Jim Jackson [jj at franjam.org.uk]


Fri, 18 Jul 2008 08:46:20 +0100 (BST)

On Thu, 17 Jul 2008, Ben Okopnik wrote:

> ``
> echo preg_replace('/"(.*)"/', '$1', $name);
> ''
>
> ``
> echo preg_replace('/"([^"]+)"/', '$1', $name);
> ''

Any reason for the use of '+' instead of '*' in the second example? It could be there is a null string enclosed in quotes, which the first one would get and the second would miss.


Top    Back


Ben Okopnik [ben at linuxgazette.net]


Fri, 18 Jul 2008 08:31:56 -0400

On Fri, Jul 18, 2008 at 08:46:20AM +0100, Jim Jackson wrote:

> 
> 
> 
> On Thu, 17 Jul 2008, Ben Okopnik wrote:
> 
> 
> > ``
> > echo preg_replace('/"(.*)"/', '$1', $name);
> > ''
> >
> 
> > ``
> > echo preg_replace('/"([^"]+)"/', '$1', $name);
> > ''
> 
> Any reason for the use of '+' instead of '*' in the second example? It 
> could be there is a null string enclosed in quotes, which the first one 
> would get and the second would miss.

I've been working with regexes for many years now, and have never seen a practical reason for matching a null string. Do you know of a situation in which having a null string is to be preferred over 'undef' (the result of checking $1 when no capture has occurred)?

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back


Jim Jackson [jj at franjam.org.uk]


Fri, 18 Jul 2008 14:41:46 +0100 (BST)

On Fri, 18 Jul 2008, Ben Okopnik wrote:

> On Fri, Jul 18, 2008 at 08:46:20AM +0100, Jim Jackson wrote:
>>
>>
>>
>> On Thu, 17 Jul 2008, Ben Okopnik wrote:
>>
>>
>>> ``
>>> echo preg_replace('/"(.*)"/', '$1', $name);
>>> ''
>>>
>>
>>> ``
>>> echo preg_replace('/"([^"]+)"/', '$1', $name);
>>> ''
>>
>> Any reason for the use of '+' instead of '*' in the second example? It
>> could be there is a null string enclosed in quotes, which the first one
>> would get and the second would miss.
>
> I've been working with regexes for many years now, and have never seen a
> practical reason for matching a null string. Do you know of a situation
> in which having a null string is to be preferred over 'undef' (the
> result of checking $1 when no capture has occurred)?

Still doesn't answer why you use the zero the first solution, and the one or more match operator '+' in the second example?

Maybe this string is valid input...

An "" example

A zero length string indicates the input was valid, and undef would indicate the input line was not of the correct format. A zero length string is often a perfectly ok value, and is different from nothing found.


Top    Back


Jim Jackson [jj at franjam.org.uk]


Fri, 18 Jul 2008 14:50:39 +0100 (BST)

On Fri, 18 Jul 2008, Jim Jackson wrote:

> On Fri, 18 Jul 2008, Ben Okopnik wrote:
>> On Fri, Jul 18, 2008 at 08:46:20AM +0100, Jim Jackson wrote:
>>> On Thu, 17 Jul 2008, Ben Okopnik wrote:
>>>
>>>> ``
>>>> echo preg_replace('/"(.*)"/', '$1', $name);
>>>> ''
>>>>
>>>
>>>> ``
>>>> echo preg_replace('/"([^"]+)"/', '$1', $name);
>>>> ''
>>>
>>> Any reason for the use of '+' instead of '*' in the second example? It
>>> could be there is a null string enclosed in quotes, which the first one
>>> would get and the second would miss.
>>
>> I've been working with regexes for many years now, and have never seen a
>> practical reason for matching a null string. Do you know of a situation
>> in which having a null string is to be preferred over 'undef' (the
>> result of checking $1 when no capture has occurred)?
>

oops, I pressed send in too much haste...

> Still doesn't answer why you use the zero the first solution, and the
zero or more match operator in the first solution, and the

> one or more match operator '+' in the second example?
>
> Maybe this string is valid input...
>
>   An "" example
>
> A zero length string indicates the input was valid, and undef would
> indicate the input line was not of the correct format. A zero length string
> is often a perfectly ok value, and is different from nothing found.

just curious.


Top    Back


Deividson Okopnik [deivid.okop at gmail.com]


Fri, 18 Jul 2008 12:37:59 -0300

actually, i just noticed there are 2 sets of quotes in the string (the RSS returns a link <a href="blablabla">). Im using preg_replace('/.*"([^"]+)".*/', '$1', $verse_body), but its returning the content of the second quotes pair of quotes...


Top    Back


Ben Okopnik [ben at linuxgazette.net]


Fri, 18 Jul 2008 15:51:35 -0400

On Fri, Jul 18, 2008 at 12:37:59PM -0300, Deividson Okopnik wrote:

> actually, i just noticed there are 2 sets of quotes in the string (the
> RSS returns a link <a href="blablabla">). Im using
> preg_replace('/.*"([^"]+)".*/', '$1', $verse_body), but its returning
> the content of the second quotes pair of quotes...

Yep - since the initial '.*' is (correctly) greedy and consumes everything up to the last pair of quotes. If you always want the first pair, you could specify that in a couple of different ways in PHP:

// Method #1
preg_match('/"([^"]+)"/', $verse_body, $found);
echo $found[1];
 
// Method #2
echo preg_replace('/^[^"]+"([^"]+)".*/', '$1', $verse_body);
-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back


Ben Okopnik [ben at linuxgazette.net]


Fri, 18 Jul 2008 16:22:21 -0400

On Fri, Jul 18, 2008 at 02:41:46PM +0100, Jim Jackson wrote:

> On Fri, 18 Jul 2008, Ben Okopnik wrote:
> 
> Still doesn't answer why you use the zero the first solution, and the 
> one or more match operator '+' in the second example?

This is like asking why I would use a cup to drink my coffee one morning and a mug the next. The answer is, there's no real reason - since it doesn't matter one way or the other. If there's any reason at all, it may well be that I didn't do the dishes the night before and that the mug happened to be clean - i.e., the reason doesn't have anything to do with the thing you're asking about.

There are plenty of situations where '*' vs. '+' would matter, of course. This just doesn't happen to be one of them.

> Maybe this string is valid input...
> 
>    An "" example
> 
> A zero length string indicates the input was valid, and undef would 
> indicate the input line was not of the correct format. 

Really? That's a new one on me. In fact, I can demonstrate that this is incorrect in both directions.

ben@Tyr:~$ perl -wle'$a=undef; $b=qq["$a"]; $b=~/"([^"]*)"/; print $1'
Use of uninitialized value in concatenation (.) or string at -e line 1.

Even though the format was indeed correct - i.e., there were two double quotes in the string - the capture returned 'undef'.

ben@Tyr:~$ perl -wle'$b=qq["""]; $b=~/"([^"]*)"/; print "-$1-"'
-- 
quotes in the string - the capture returned an empty string.

> A zero length string
> is often a perfectly ok value, and is different from nothing found.

"undef" is also often a perfectly OK value, although it is indeed different from an empty string.

Jim, I understand that you're wondering about the inconsistency in my two regexes. The inconsistency is indeed there, but - as I've explained above - in the case of the problem as originally defined by Deividson, it really makes zero difference. Your idea about "correct format", though, is a case of making way too much soup out of a single oyster.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back


Deividson Okopnik [deivid.okop at gmail.com]


Fri, 18 Jul 2008 18:06:18 -0300

huh

weirdly enough, both ways are still returning me the content of the second pair of quotes... On method 1, found [0] is the content of the last pair of quotes (inside quotes), found[1] is the same content, but without no quotes, and finally found[2] is empty.

the first content have spaces - can that be a problem?

this is exactly what the server returns me, and it gets stored inside $verse_body: "I will praise you with an upright heart as I learn your righteous laws."<br><br> Brought to you by <a href="http://www.biblegateway.com">BibleGateway.com</a>. Copyright (C) NIV. All Rights Reserved.

> Yep - since the initial '.*' is (correctly) greedy and consumes
> everything up to the last pair of quotes. If you always want the first
> pair, you could specify that in a couple of different ways in PHP:
>
> ``
> // Method #1
> preg_match('/"([^"]+)"/', $verse_body, $found);
> echo $found[1];
>
> // Method #2
> echo preg_replace('/^[^"]+"([^"]+)".*/', '$1', $verse_body);
> ''


Top    Back


Ben Okopnik [ben at linuxgazette.net]


Fri, 18 Jul 2008 17:48:08 -0400

On Fri, Jul 18, 2008 at 06:06:18PM -0300, Deividson Okopnik wrote:

> huh
> 
> weirdly enough, both ways are still returning me the content of the
> second pair of quotes... On method 1, found [0] is the content of the
> last pair of quotes (inside quotes), found[1] is the same content, but
> without no quotes, and finally found[2] is empty.
> 
> the first content have spaces - can that be a problem?
> 
> this is exactly what the server returns me, and it gets stored inside
> $verse_body:
> "I will praise you with an upright heart  as I learn your righteous
> laws."<br><br> Brought to you by <a
> href="http://www.biblegateway.com">BibleGateway.com</a>. Copyright (C)
> NIV. All Rights Reserved.
If you show us the wrong data, you're likely to get the wrong answer. :) The regex would have worked fine for the string that you initially showed us.

I've just taken a look at the site, and the line you're trying to process does not contain what you think it does. "View source" shows the following:

&ldquo;I will praise you with an upright heart as I learn your righteous laws.&rdquo;- [...]

This will, of course, not work with the regex. You'll need to do some processing first - PHP has functions for converting HTML to text - and then do the extraction. HTML can be pretty tricky that way.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back


Jim Jackson [jj at franjam.org.uk]


Mon, 21 Jul 2008 08:57:43 +0100 (BST)

On Fri, 18 Jul 2008, Deividson Okopnik wrote:

> actually, i just noticed there are 2 sets of quotes in the string (the
> RSS returns a link <a href="blablabla">). Im using
> preg_replace('/.*"([^"]+)".*/', '$1', $verse_body), but its returning
> the content of the second quotes pair of quotes...

It's being greedy, as has already been said. You need to alter the regexp to something like....

  '/[^"]*"([^"]+)".*/'

i.e. match any non-" chars and find the first "

>
> +-+--------------------------------------------------------------------+-+
> You've asked a question of The Answer Gang, so you've been sent the reply
> directly as a courtesy.  The TAG list has also been copied.  Please send
> all replies to tag@lists.linuxgazette.net, so that we can help our other
> readers by publishing the exchange in our monthly Web magazine:
>              Linux Gazette (http://linuxgazette.net/)
> +-+--------------------------------------------------------------------+-+
> _____________________________________________
> TAG mailing list
> TAG@lists.linuxgazette.net
> http://lists.linuxgazette.net/mailman/listinfo/tag
>


Top    Back