Extracting numbers with regular expressions

Shortly after I wrote about extracting numbers from text strings, Winston Snyder at dataprose.org wrote a detailed article about using “regular expressions” to separate text from numbers in any string.  I have adapted his routine for the same purpose as in the previous post, that is to extract a single numerical value from a text string.  The regular expressions approach has two main advantages:

  • The same function can be used to extract numbers from the left, right, or middle of a text string.
  • No delineators are required.

The only drawback is that if the text string contains more than one number the function will concatenate them if they are integers, or return zero if they both have decimals.

The new ExtractNum function has been added to GetNum.xlsb, and  Text-in2.xlsb, and is shown in use in the screenshot below:

ExtractNum Function

ExtractNum Function

This function is only scratching the surface of what can be done with regular expressions. For more details and links see dataprose.org.

This entry was posted in Excel, UDFs, VBA and tagged , , , , . Bookmark the permalink.

19 Responses to Extracting numbers with regular expressions

  1. Georg says:

    Hi Doug,
    looks like it is time for creating PerXLL??? At least, imho the language Perl is the Jedi Knight of RegEx. Some people have already tried to call Perl scripts from within VBA, and to deal with the results as well:
    http://stackoverflow.com/questions/618116/how-do-i-call-a-perl-script-from-excel
    http://stackoverflow.com/questions/5176032/is-there-a-way-to-run-perl-code-from-inside-excel-and-have-it-return-a-hash
    Does not look too discouraging, does it??
    I got interested in this subject because gmpy2 (via PyXLL) does not always return “nice” strings to VBA…
    All the best
    Georg

    Like

  2. dougaj4 says:

    Hi Georg, I’m not sure that it would be productive for me to delve into yet another programming language, but if you find interesting Perl related links I will certainly be glad to see them!

    As I said, I have only scratched the surface of using RegEx in VBA (or anywhere else for that matter). Do you see significant advantages in using external sources as the RegEx engine?

    Like

    • Georg says:

      Hi Doug,
      you are absolutely right, it would be best if you could make GMPY2/PyXLL to return nice strings for sure. I use PyXLL 2.7 for education, and I am in charge of creating a brief tutorial containing worked examples of how to connect XL to xNumbers, MPMATH, GMPY2 a.s.o. until July.
      So far, I’ve never experienced problems with any MP PY lib when defining functions in PyXLL according to the following scheme:

      import gmpy2 as gm
      @xl_func(“int myPrec: string”, category=”my PyXLL funcs”)
      def MPY_SetPREC(myPrec):
         gm.get_context().precision = myPrec
         myResult = “GMPY2 precision has been set to ” + str(myPrec) + ” bits.”
         return myResult

      @xl_func(“string myX: string”, category=”my PyXLL funcs”)
      def MPY_ACOS(myX):
         return str(gm.acos(gm.mpfr(myX)))
      ————
      in XL2013 with Py 2.7.6 in default bit-prec (53) calling
      =MPY_ACOS(0.875) then gives me:
      -> 0.50536051028415729
      =MPY_SetPREC(106)
      -> GMPY2 precision has been set to 106 bits.
      =MPY_ACOS(0.875)
      -> 0.505360510284157306971314873987428

      Same with mpmath:
      from mpmath import mp
      @xl_func(“int myPrec: string”, category=”my PyXLL funcs”)
      def PY_mpPREC(myPrec):
         mp.dps = myPrec
         mp.pretty = True
         myResult = “MPMATH precision has been set to ” + str(myPrec) + ” decimal figures.”
         return myResult

      @xl_func(“string myX: string”, category=”my PyXLL funcs”)
      def PY_mpACOS(myX):
         return str(mp.acos(myX))

      =PY_mpPREC(33)
      -> Precision in mpmath has been set to 33 decimal figures.
      =PY_mpACOS(0.875)
      -> 0.505360510284157306971314873987422

      Care has to be taken when converting MP values down to doubles, because XL chops off all numerical characters after the 15th significant one, whereas VBA uses at max. 20 of them (thanks to John Beyers for this hint). The x2dbl3 function of xNumbers does is always correctly; but is seems best to do the conversion in the framework of the PY lib used (checking that routine for correctness is needed, of course).

      In the example given, I get the following hex representations:
      x2dbl3 -> 3F E0 2B E9 CE 0B 87 CD
      XL val -> 3F E0 2B E9 CE 0B 87 CA
      VB val -> 3F E0 2B E9 CE 0B 87 CD

      So where do you experience problems with ugly strings returned from PY…?

      Like

      • Georg says:

        Just an addition: if strings to be returned from PY to VBA/XL become longer than 255 characters, none of these trailing characters does it make into VBA/XL but is silently chopped off. The most straightforward solution (as mentioned in a recent post on the PyXLL forum) is to use “var” as return type for such a PyXLL UDF. I’ve verified that this works with gm.fac(200), which then returns all 375 chars.

        Like

  3. dougaj4 says:

    Hi Georg,
    I haven’t actually done much work with strings form Python yet. My main need at the moment is just to strip integers from the beginning or end of text labels that come from a finite element package I use. which is quite straightforward.

    I have had some problems with transferring arrays with mixed floats and strings to Numpy arrays, but I’ve come to the conclusion that its best to avoid doing that (and also avoid blank cells).

    One point I don’t think is right is VBA using 20 sig. figs. Both VBA and Excel worksheet calcs use 64 bits, but Excel rounds to 15 s.f. and transfers the rounded number. If you enter =PI() in a cell, it transfers the 64 bit value, but if you convert it to a value it rounds it to 15 decimal s.f.

    In VBA, if you define pi by entering the value to 16 or more s.f. it rounds the display to 15 s.f, but it still uses the 64 bit value! Also there is a Decimal data type which stores 28 s.f.; see code below:

    Function TestPi(xlPi As Double)
    Dim VBpi(1 To 8) As Double, Pidiff(1 To 9, 1 To 1) As Double, i As Long, QPi As Variant
    VBpi(1) = Atn(1) * 4
    VBpi(2) = “3.1415926535897932385”
    VBpi(3) = “3.141592653589793239”
    VBpi(4) = “3.14159265358979324”
    VBpi(5) = “3.1415926535897932”
    VBpi(6) = “3.141592653589793”
    VBpi(7) = 3.14159265358979 ‘ Entered as 3.141592653589793
    VBpi(8) = 3.14159265358979
    QPi = CDec(“3.141592653589793238462643383279”)
    For i = 1 To 8
    Pidiff(i, 1) = xlPi – VBpi(i)
    Next i
    Pidiff(9, 1) = xlPi – QPi
    TestPi = Pidiff
    End Function

    Now enter =TestPi(B1) as an array function, with =Pi() in B1, and you get:
    0
    0
    0
    0
    0
    0
    0
    3.10862E-15
    -3.23850000E-15

    But convert B1 to a number (press F2, F9) and you get:
    3.141592654
    -3.10862E-15
    -3.10862E-15
    -3.10862E-15
    -3.10862E-15
    -3.10862E-15
    -3.10862E-15
    -3.10862E-15
    0
    -3.2384626434E-15
    !

    Like

    • Lori says:

      Doug – There was a previous discussion on exactly this subject in the comments to your post of 18 months ago: https://newtonexcelbach.wordpress.com/2012/10/09/daily-download-23-trigonometric-functions-in-vba-and-fortran/. Georg suggested that vba uses 20 digits to determine which binary number to round to which does appear to be the case.

      Try in the immediate window this 20sf string:
      ?[pi()]-“3.1415926535897933381”
      This returns -4.44089209850063E-16.
      Now change the last digit “1” to “09999999999” and it returns 0.

      Like

      • dougaj4 says:

        Hi Lori,
        I thought it seemed familiar, I should have looked it up 🙂
        When I try your numbers in the immediate window I get 0 in both cases (32 bit XL 2010 and 64 bit Windows).
        But I take your point, VBA stores doubles as 64 bits, but when converting a text string to a double, it uses the first 20 figures, which is what Georg said.
        I’d be interested to know if the different behaviour with the 20 figure text for pi is related to the XL version, OS, or possibly a typo?

        Just did a bit of trial and error:
        Pi – “3.1415926535897933383” returns -4.44089209850063E-16 but
        Pi – “3.141592653589793338299999999” returns 0
        both in the immediate window and in my UDF

        Like

      • dougaj4 says:

        By the way, I didn’t know you could use [] as short for Worksheetfunction, so thanks for that.

        Like

    • Georg says:

      Hi Doug and Lori,
      I think I did not make myself clear enough. What I wanted to tell is that when parsing a numerical string to a double value, XL chops off all numerical characters beyond that corresponding to the 15th decimal significant figure, before actually converting the eventually shortened string into a numerical value of the type IEEE754 double precision. VBA uses 20 dec. sig. figs. Contrary, the “x2dbl3” function supplied by Xnumbers uses the entire string in order to reliably convert it to the nearest double value; it implements the tie-breaking rule “to even”. I have not yet thoroughly tested PY’s string parser… Please note that bugs have been discovered (and fixed) quite recently in a widely used 20 year old C parser called strtod, as is documented on the blog exploringbinary.com, for example. I have not yet tried out whether some versions of PY are be affected by that bug or not.
      I will try to compile an instructive set of string parsing examples this weekend. After it will be finished, I’ll share it here.

      The display of IEEE754 DP values in XL is a nearly completely different subject; please mind the nasty “magic parentheses” effect when subtracting two nearly identical values from each other. For example, when you sum up 10,20,30,…,100 cells containing 0.1, the cell formula =XL_sum-exact_sum will yield zero for 10,20,…,60 cells being summed up, but suddenly show -8.88E-15, -1.24E-14, -1.60E-14, and -1.95E-14 for 70,80,90, and 100 cells being summed up. Using “magic parentheses” in the cell formula, i.e. =(XL_sum-exact_sum), will force XL to show the unexpected but correct(*) results -1.11E-16, 4.44E-16, 1.33E-15, 1.78E-15, -1.78E-15, and -5.33E-15 for 10,20,…,60 cells being summed up. (*) correct here means: to be expected when using the naive summation algorithm in double precision binary floating point.
      This is a nice example showing why you should never ever set up a range in floating point by using naive summation in combination with step sizes that are not exactly representable in binary floating point. And now tighten your seat belt and have a look into the source code of widely used ODE solvers, for example…
      VBA harnesses the 11 extra bits of mantissa precision available on the x87 stack in single-line statements, which XL does not. You can see this from comparing the results of calculations like =((1.5+1/2^n)-(1.5-1/2^n)) for n = 1,2,…,64. In XL, the result will be different from zero up to n = 52 (up to n=50 without the outer “magic” parentheses), but in VBA it will differ from zero up to n=63:
      Option Explicit
      Const one# = 1#
      Const two# = 2#
      Const threehalves# = 1.5
      Function VBA_CHECK#(myPow2!)
      Dim del#
      del = one / two ^ myPow2
      VBA_CHECK = (threehalves + del) – (threehalves – del)
      End Function
      HTH Georg

      Like

      • dougaj4 says:

        Georg – I’ve got the point about the 20 figures of text now.

        I have just remembered one other aspect of transferring numbers as strings which may be important, and was covered in this post:
        https://newtonexcelbach.wordpress.com/2014/03/16/transfer-of-arrays-tofrom-python-with-pyxll-part1-data-types/
        Towards the end it makes the point that values in a numpy string array, written back to Excel directly from Python, were truncated to only 12 significant figures. I don’t recall the exact details now, but I wondered if you had encountered this?

        Like

      • dougaj4 says:

        Georg – also thanks for the final part of your comment, about the extra 7 bits available to VBA. I didn’t know that!

        Just tried the VBA_Check function using a Dec data type:
        Function VBA_CHECK4(myPow2)
        Dim del As Variant
        del = CDec(1 / 2 ^ myPow2)
        VBA_CHECK4 = (1.5 + del) – (1.5 – del)
        End Function

        This returns a non-zero result for n up to 94, although the return value has only one significant figure for n = 91 or greater, and for n = 94 the result is about double the correct value (2E-28, compared with 1.00974E-28).

        Like

  4. Lori says:

    Strange there is a discrepancy, albeit small. Two different setups (Office 2010/2013 64-Bit Windows 7/8) gave the same results in tests…
    3.1415926535897928940 <= PI < 3.1415926535897933381
    The difference at the boundaries is equal to 2^-51.

    Like

  5. Georg says:

    Hi Doug,
    if I remember correctly (IIRC), the currency/decimal tpyes are *fixed* point variable types supporting up to 28 decimal significant figures… long time ago, I played with a VBA memory hexdump routine and IIRC found out that each decimal figure is coded by one byte… I hope the next weekend will be rainy…
    Back to VBA string to double parsing: I’ve defined the following three VBA functions:
    Function VBA_Cdbl_of_Val_of_Var#(ByVal myX As Variant)
    VBA_Cdbl_of_Val_of_Var = CDbl(Val(myX))
    End Function
    Function VBA_Cdbl_of_Var#(ByVal myX As Variant)
    VBA_Cdbl_of_Var = CDbl(myX)
    End Function
    Function VBA_implicit_cast_on_input#(ByVal myX As Double)
    VBA_implicit_cast_on_input = myX
    End Function
    I managed to get my hands on the following three XL installations: XL15 on W7Prof64, XL14 on W7Prof64, and XL10 on WXP32. I used the following multi-precision string for testing:
    xN := square root of pi divided by 14195. In decimal notation, we thus have:

    DP_lo  1.248646601553727252496839...E-4
    xN     1.24864660155372738802266....E-4
    sigFig 1234567890123456789012345
    DP_tie 1.24864660155372738802211....E-4
    DP_hi  1.248646601553727523547382...E-4

    What you can see from this example is
    1. you need at least 23 dec. sigFigs in order to enable an ideal parser to find the double nearest to a multi-precision numerical streing, which in this case turns out to be DP_hi.
    2. you might need 17 dec. sigFigs in order to uniquely serialise a double precision value.
    The results concerning string parsing to double are as follows:
    A) The value function of all three XL versions from 2002 to 2013 definitely uses only the first 15 dec. sigFigs. of a string.
    B) The VBA val function has been greatly improved from XL10 to XL14. In case of XL14 and XL15 (VBA 7), it supports at least 23 dec. sigFigs.
    C) The VBA Cdbl function as well as the procedure involved when implicitly type casting a string on input have not been improved; they definitely use only the first 20 dec. sigFigs. of a string. (From results obtained using the example, you can only conclude that they do not use 23 sigFigs, but I did not want to write a novel here…)
    Thanks for notifying me of problems with NumPY-PyXLL, I’ll have a look on that…
    Lori, I’m sorry, but due Win8(.1) lacking a WinXP virtual machine, there is no chance for me to get my hands on such an XL installation. Perhaps, MS have decided to switch from double extended 80 bit x87 stack to 53 bit SSE in VBA in Win8(.), but that is just an inspired guess…

    Like

    • Georg says:

      there should appear a space between the leading “1” and following “2” in the “sigFig” row… lost in blogging… sorry…

      Like

  6. Georg says:

    Hi Doug,
    according to this thread:
    http://stackoverflow.com/questions/3264828/change-default-float-print-format
    12 decimal digits seem to be the default setting when casting a float into a string in case PY 2.x is used. Looks like you have two choices: modify the proper #define statement in the C source and recompile PY yourself, or you might modify the PY code in the output section of your FEM module, if it is open source. In PY2.7.x, you could use format(myFloat,’.16e’) instead of str(myFloat), for example, in order to return 17 significant decimal digits.
    Using this formatting, I’ve just noticed that the PY 2.7.6 float(mpString) statement does not always yield double precision values that are closest to given multi-precision strings, arrrrrrgh

    Like

    • dougaj4 says:

      Hi Georg,
      Thanks for your input on this. If you reach a resolution on these issues, I’d certainly be interested.

      Like

  7. Pingback: Extracting numbers from text and regular expressions | Newton Excel Bach, not (just) an Excel Blog

  8. Pingback: Dealing with dates 3: opening and saving csv files without data corruption | Newton Excel Bach, not (just) an Excel Blog

  9. Pingback: Importing Text Files; Unix Format | Newton Excel Bach, not (just) an Excel Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.