Matching a Quoted String With a Regular Expression
· Mar 31, 04:35 AMI needed to match quoted strings in text, so I set out to write a regular expression to find them. I am, of course, not the first person to require such a thing, and a quick web search turned up a very nice analysis of the problem. I leave you to read it, and provide here some REALbasic code. The function that follows contains a version of the regular expression adapted for the REALbasic IDE.
Function QuotedStringBody(s as String) As String
dim r as new RegEx
const quotedString = "\x22([^\x22\\]*((\\.)*[^\x22\\]*)*)\x22"
r.SearchPattern = quotedString
dim match as RegExMatch = r.Search(EditField1.Text)
if match <> nil and match.SubExpressionCount >= 2 then
return match.SubExpressionString(1)
else
return ""
end if
End Function
The regular expression language allows you to represent any literal by its ASCII code as hex (\xdd) or octal (\ndd). Here I use \x22 in the regular expression so that I do not need to escape every use of the double-quote in the IDE.
The string returned by this function still contains escaped characters. We can replace them with another regular expression.
Function QuotedStringContents(body as String) As String
dim r as new RegEx
r.Options.ReplaceAllMatches = true
const escape = "\\(.)"
r.SearchPattern = escape
r.ReplacementPattern = "$1"
return r.Replace(body)
End Function
It would be nicer to have a regular expression that returned the quoted string body as subexpression 0. With the use of some fancier features of regular expressions, we can do just that.
First, the original regular expression contains quite a few subexpressions. Let’s begin by removing the parentheses used to group the quoted string body.
const quotedString = "\x22([^\x22\\]*((\\.)*[^\x22\\]*)*)\x22"
We can use the operator ?: to tell the RegEx object that some parentheses are for grouping only, so that no subexpression need be kept.
const quotedString = "\x22[^\x22\\]*(?:(?:\\.)*[^\x22\\]*)*\x22"
Next, we want the regular expression to look for a pattern that follows a ", without including the " in the match . This is accomplished with the operator ?<=.
const quotedString = "(?<=\x22)[^\x22\\]*(?:(?:\\.)*[^\x22\\]*)*\x22"
And then we want the same thing for the terminal quote; that is, we want to match a certain pattern followed by a ", without including the " in the match. For this we use the operator ?=.
"(?<=\x22)[^\x22\\]*(?:(?:\\.)*[^\x22\\]*)*(?=\x22)"
This does the trick; we can now simplify the function QuotedStringBody.
Function QuotedStringBody(s as String) As String
dim r as new RegEx
const quotedString = "(?<=\x22)[^\x22\\]*(?:(?:\\.)*[^\x22\\]*)*(?=\x22)"
r.SearchPattern = quotedString
dim match as RegExMatch = r.Search(EditField1.Text)
if match <> nil then
return match.SubExpressionString(0)
else
return ""
end if
End Function
Comment
Trivial Client-Server Example Marking Code For Later Attention
Wow, thanks! This is very useful to me. Never knew such advanced operators like ?: existed.
— Thomas Tempelmann · Mar 31, 02:52 PM · #
What if the string is no ASCII-string but encoded diferent? Does using the Hex-values work as well?
— Nat · Apr 3, 01:25 AM · #
Using the hex values should work in any case.
— charles · Apr 10, 09:24 AM · #