Matching a Quoted String With a Regular Expression

· Mar 31, 04:35 AM

I needed to match quoted strings in text, so I set out to write a regular expression to find them. I am, of course, not the first person to require such a thing, and a quick web search turned up a very nice analysis of the problem. I leave you to read it, and provide here some REALbasic code. The function that follows contains a version of the regular expression adapted for the REALbasic IDE.

Function QuotedStringBody(s as String) As String
  dim r as new RegEx
  const quotedString = "\x22([^\x22\\]*((\\.)*[^\x22\\]*)*)\x22"
  r.SearchPattern = quotedString
  dim match as RegExMatch = r.Search(EditField1.Text)
  if match <> nil and match.SubExpressionCount >= 2 then
    return match.SubExpressionString(1)
  else
    return ""
  end if
End Function

The regular expression language allows you to represent any literal by its ASCII code as hex (\xdd) or octal (\ndd). Here I use \x22 in the regular expression so that I do not need to escape every use of the double-quote in the IDE.

The string returned by this function still contains escaped characters. We can replace them with another regular expression.

Function QuotedStringContents(body as String) As String
  dim r as new RegEx
  r.Options.ReplaceAllMatches = true
  const escape = "\\(.)"
  r.SearchPattern = escape
  r.ReplacementPattern = "$1"
  return r.Replace(body)
End Function

It would be nicer to have a regular expression that returned the quoted string body as subexpression 0. With the use of some fancier features of regular expressions, we can do just that.

First, the original regular expression contains quite a few subexpressions. Let’s begin by removing the parentheses used to group the quoted string body.

const quotedString = "\x22([^\x22\\]*((\\.)*[^\x22\\]*)*)\x22"

We can use the operator ?: to tell the RegEx object that some parentheses are for grouping only, so that no subexpression need be kept.

const quotedString = "\x22[^\x22\\]*(?:(?:\\.)*[^\x22\\]*)*\x22"

Next, we want the regular expression to look for a pattern that follows a ", without including the " in the match . This is accomplished with the operator ?<=.

const quotedString = "(?<=\x22)[^\x22\\]*(?:(?:\\.)*[^\x22\\]*)*\x22"

And then we want the same thing for the terminal quote; that is, we want to match a certain pattern followed by a ", without including the " in the match. For this we use the operator ?=.

"(?<=\x22)[^\x22\\]*(?:(?:\\.)*[^\x22\\]*)*(?=\x22)"

This does the trick; we can now simplify the function QuotedStringBody.

Function QuotedStringBody(s as String) As String
  dim r as new RegEx
  const quotedString = "(?&lt;=\x22)[^\x22\\]*(?:(?:\\.)*[^\x22\\]*)*(?=\x22)"
  r.SearchPattern = quotedString
  dim match as RegExMatch = r.Search(EditField1.Text)
  if match <> nil then
    return match.SubExpressionString(0)
  else
    return ""
  end if
End Function
---

Comment

  1. Wow, thanks! This is very useful to me. Never knew such advanced operators like ?: existed.

    Thomas Tempelmann · Mar 31, 02:52 PM · #

  2. What if the string is no ASCII-string but encoded diferent? Does using the Hex-values work as well?

    Nat · Apr 3, 01:25 AM · #

  3. Using the hex values should work in any case.

    charles · Apr 10, 09:24 AM · #

Commenting is closed for this article.