Extracting numbers with regular expressions

Posted on 12/04/2014 by dougaj4

Shortly after I wrote about extracting numbers from text strings, Winston Snyder at dataprose.org wrote a detailed article about using “regular expressions” to separate text from numbers in any string. I have adapted his routine for the same purpose as in the previous post, that is to extract a single numerical value from a text string. The regular expressions approach has two main advantages:

The same function can be used to extract numbers from the left, right, or middle of a text string.
No delineators are required.

The only drawback is that if the text string contains more than one number the function will concatenate them if they are integers, or return zero if they both have decimals.

The new ExtractNum function has been added to GetNum.xlsb, and Text-in2.xlsb, and is shown in use in the screenshot below:

ExtractNum Function

This function is only scratching the surface of what can be done with regular expressions. For more details and links see dataprose.org.

This entry was posted in Excel, UDFs, VBA and tagged Excel, extracting numbers from text, regular expressions, UDF, VBA. Bookmark the permalink.

19 Responses to Extracting numbers with regular expressions

Georg says:

13/04/2014 at 8:59 am

Hi Doug,
looks like it is time for creating PerXLL??? At least, imho the language Perl is the Jedi Knight of RegEx. Some people have already tried to call Perl scripts from within VBA, and to deal with the results as well:
http://stackoverflow.com/questions/618116/how-do-i-call-a-perl-script-from-excel
http://stackoverflow.com/questions/5176032/is-there-a-way-to-run-perl-code-from-inside-excel-and-have-it-return-a-hash
Does not look too discouraging, does it??
I got interested in this subject because gmpy2 (via PyXLL) does not always return “nice” strings to VBA…
All the best
Georg

LikeLike

Reply
dougaj4 says:

13/04/2014 at 9:27 am

Hi Georg, I’m not sure that it would be productive for me to delve into yet another programming language, but if you find interesting Perl related links I will certainly be glad to see them!

As I said, I have only scratched the surface of using RegEx in VBA (or anywhere else for that matter). Do you see significant advantages in using external sources as the RegEx engine?

LikeLike

Reply
- Georg says:
  
  04/05/2014 at 10:03 pm
  
  Hi Doug,
  you are absolutely right, it would be best if you could make GMPY2/PyXLL to return nice strings for sure. I use PyXLL 2.7 for education, and I am in charge of creating a brief tutorial containing worked examples of how to connect XL to xNumbers, MPMATH, GMPY2 a.s.o. until July.
  So far, I’ve never experienced problems with any MP PY lib when defining functions in PyXLL according to the following scheme:
  
  import gmpy2 as gm
  @xl_func(“int myPrec: string”, category=”my PyXLL funcs”)
  def MPY_SetPREC(myPrec):
     gm.get_context().precision = myPrec
     myResult = “GMPY2 precision has been set to ” + str(myPrec) + ” bits.”
     return myResult
  
  @xl_func(“string myX: string”, category=”my PyXLL funcs”)
  def MPY_ACOS(myX):
     return str(gm.acos(gm.mpfr(myX)))
  ————
  in XL2013 with Py 2.7.6 in default bit-prec (53) calling
  =MPY_ACOS(0.875) then gives me:
  -> 0.50536051028415729
  =MPY_SetPREC(106)
  -> GMPY2 precision has been set to 106 bits.
  =MPY_ACOS(0.875)
  -> 0.505360510284157306971314873987428
  
  Same with mpmath:
  from mpmath import mp
  @xl_func(“int myPrec: string”, category=”my PyXLL funcs”)
  def PY_mpPREC(myPrec):
     mp.dps = myPrec
     mp.pretty = True
     myResult = “MPMATH precision has been set to ” + str(myPrec) + ” decimal figures.”
     return myResult
  
  @xl_func(“string myX: string”, category=”my PyXLL funcs”)
  def PY_mpACOS(myX):
     return str(mp.acos(myX))
  
  =PY_mpPREC(33)
  -> Precision in mpmath has been set to 33 decimal figures.
  =PY_mpACOS(0.875)
  -> 0.505360510284157306971314873987422
  
  Care has to be taken when converting MP values down to doubles, because XL chops off all numerical characters after the 15th significant one, whereas VBA uses at max. 20 of them (thanks to John Beyers for this hint). The x2dbl3 function of xNumbers does is always correctly; but is seems best to do the conversion in the framework of the PY lib used (checking that routine for correctness is needed, of course).
  
  In the example given, I get the following hex representations:
  x2dbl3 -> 3F E0 2B E9 CE 0B 87 CD
  XL val -> 3F E0 2B E9 CE 0B 87 CA
  VB val -> 3F E0 2B E9 CE 0B 87 CD
  
  So where do you experience problems with ugly strings returned from PY…?
  
  LikeLike
  
  Reply
  - Georg says:
    
    06/05/2014 at 11:15 pm
    
    Just an addition: if strings to be returned from PY to VBA/XL become longer than 255 characters, none of these trailing characters does it make into VBA/XL but is silently chopped off. The most straightforward solution (as mentioned in a recent post on the PyXLL forum) is to use “var” as return type for such a PyXLL UDF. I’ve verified that this works with gm.fac(200), which then returns all 375 chars.
    
    LikeLike
    
    Reply
dougaj4 says:

08/05/2014 at 9:25 pm

Hi Georg,
I haven’t actually done much work with strings form Python yet. My main need at the moment is just to strip integers from the beginning or end of text labels that come from a finite element package I use. which is quite straightforward.

I have had some problems with transferring arrays with mixed floats and strings to Numpy arrays, but I’ve come to the conclusion that its best to avoid doing that (and also avoid blank cells).

One point I don’t think is right is VBA using 20 sig. figs. Both VBA and Excel worksheet calcs use 64 bits, but Excel rounds to 15 s.f. and transfers the rounded number. If you enter =PI() in a cell, it transfers the 64 bit value, but if you convert it to a value it rounds it to 15 decimal s.f.

In VBA, if you define pi by entering the value to 16 or more s.f. it rounds the display to 15 s.f, but it still uses the 64 bit value! Also there is a Decimal data type which stores 28 s.f.; see code below:

Function TestPi(xlPi As Double)
Dim VBpi(1 To 8) As Double, Pidiff(1 To 9, 1 To 1) As Double, i As Long, QPi As Variant
VBpi(1) = Atn(1) * 4
VBpi(2) = “3.1415926535897932385”
VBpi(3) = “3.141592653589793239”
VBpi(4) = “3.14159265358979324”
VBpi(5) = “3.1415926535897932”
VBpi(6) = “3.141592653589793”
VBpi(7) = 3.14159265358979 ‘ Entered as 3.141592653589793
VBpi(8) = 3.14159265358979
QPi = CDec(“3.141592653589793238462643383279”)
For i = 1 To 8
Pidiff(i, 1) = xlPi – VBpi(i)
Next i
Pidiff(9, 1) = xlPi – QPi
TestPi = Pidiff
End Function

Now enter =TestPi(B1) as an array function, with =Pi() in B1, and you get:
0
0
0
0
0
0
0
3.10862E-15
-3.23850000E-15

But convert B1 to a number (press F2, F9) and you get:
3.141592654
-3.10862E-15
-3.10862E-15
-3.10862E-15
-3.10862E-15
-3.10862E-15
-3.10862E-15
-3.10862E-15
0
-3.2384626434E-15
!

LikeLike

Reply
- Lori says:
  
  10/05/2014 at 8:20 am
  
  Doug – There was a previous discussion on exactly this subject in the comments to your post of 18 months ago: https://newtonexcelbach.wordpress.com/2012/10/09/daily-download-23-trigonometric-functions-in-vba-and-fortran/. Georg suggested that vba uses 20 digits to determine which binary number to round to which does appear to be the case.
  
  Try in the immediate window this 20sf string:
  ?[pi()]-“3.1415926535897933381”
  This returns -4.44089209850063E-16.
  Now change the last digit “1” to “09999999999” and it returns 0.
  
  LikeLike
  
  Reply
  - dougaj4 says:
    
    10/05/2014 at 12:47 pm
    
    Hi Lori,
    I thought it seemed familiar, I should have looked it up 🙂
    When I try your numbers in the immediate window I get 0 in both cases (32 bit XL 2010 and 64 bit Windows).
    But I take your point, VBA stores doubles as 64 bits, but when converting a text string to a double, it uses the first 20 figures, which is what Georg said.
    I’d be interested to know if the different behaviour with the 20 figure text for pi is related to the XL version, OS, or possibly a typo?
    
    Just did a bit of trial and error:
    Pi – “3.1415926535897933383” returns -4.44089209850063E-16 but
    Pi – “3.141592653589793338299999999” returns 0
    both in the immediate window and in my UDF
    
    LikeLike
    
    Reply
  - dougaj4 says:
    
    10/05/2014 at 12:51 pm
    
    By the way, I didn’t know you could use [] as short for Worksheetfunction, so thanks for that.
    
    LikeLike
    
    Reply
- Georg says:
  
  10/05/2014 at 10:00 pm
  
  Hi Doug and Lori,
  I think I did not make myself clear enough. What I wanted to tell is that when parsing a numerical string to a double value, XL chops off all numerical characters beyond that corresponding to the 15th decimal significant figure, before actually converting the eventually shortened string into a numerical value of the type IEEE754 double precision. VBA uses 20 dec. sig. figs. Contrary, the “x2dbl3” function supplied by Xnumbers uses the entire string in order to reliably convert it to the nearest double value; it implements the tie-breaking rule “to even”. I have not yet thoroughly tested PY’s string parser… Please note that bugs have been discovered (and fixed) quite recently in a widely used 20 year old C parser called strtod, as is documented on the blog exploringbinary.com, for example. I have not yet tried out whether some versions of PY are be affected by that bug or not.
  I will try to compile an instructive set of string parsing examples this weekend. After it will be finished, I’ll share it here.
  
  The display of IEEE754 DP values in XL is a nearly completely different subject; please mind the nasty “magic parentheses” effect when subtracting two nearly identical values from each other. For example, when you sum up 10,20,30,…,100 cells containing 0.1, the cell formula =XL_sum-exact_sum will yield zero for 10,20,…,60 cells being summed up, but suddenly show -8.88E-15, -1.24E-14, -1.60E-14, and -1.95E-14 for 70,80,90, and 100 cells being summed up. Using “magic parentheses” in the cell formula, i.e. =(XL_sum-exact_sum), will force XL to show the unexpected but correct(*) results -1.11E-16, 4.44E-16, 1.33E-15, 1.78E-15, -1.78E-15, and -5.33E-15 for 10,20,…,60 cells being summed up. (*) correct here means: to be expected when using the naive summation algorithm in double precision binary floating point.
  This is a nice example showing why you should never ever set up a range in floating point by using naive summation in combination with step sizes that are not exactly representable in binary floating point. And now tighten your seat belt and have a look into the source code of widely used ODE solvers, for example…
  VBA harnesses the 11 extra bits of mantissa precision available on the x87 stack in single-line statements, which XL does not. You can see this from comparing the results of calculations like =((1.5+1/2^n)-(1.5-1/2^n)) for n = 1,2,…,64. In XL, the result will be different from zero up to n = 52 (up to n=50 without the outer “magic” parentheses), but in VBA it will differ from zero up to n=63:
  Option Explicit
  Const one# = 1#
  Const two# = 2#
  Const threehalves# = 1.5
  Function VBA_CHECK#(myPow2!)
  Dim del#
  del = one / two ^ myPow2
  VBA_CHECK = (threehalves + del) – (threehalves – del)
  End Function
  HTH Georg
  
  LikeLike
  
  Reply
  - dougaj4 says:
    
    11/05/2014 at 11:21 am
    
    Georg – I’ve got the point about the 20 figures of text now.
    
    I have just remembered one other aspect of transferring numbers as strings which may be important, and was covered in this post:
    https://newtonexcelbach.wordpress.com/2014/03/16/transfer-of-arrays-tofrom-python-with-pyxll-part1-data-types/
    Towards the end it makes the point that values in a numpy string array, written back to Excel directly from Python, were truncated to only 12 significant figures. I don’t recall the exact details now, but I wondered if you had encountered this?
    
    LikeLike
    
    Reply
  - dougaj4 says:
    
    11/05/2014 at 12:30 pm
    
    Georg – also thanks for the final part of your comment, about the extra 7 bits available to VBA. I didn’t know that!
    
    Just tried the VBA_Check function using a Dec data type:
    Function VBA_CHECK4(myPow2)
    Dim del As Variant
    del = CDec(1 / 2 ^ myPow2)
    VBA_CHECK4 = (1.5 + del) – (1.5 – del)
    End Function
    
    This returns a non-zero result for n up to 94, although the return value has only one significant figure for n = 91 or greater, and for n = 94 the result is about double the correct value (2E-28, compared with 1.00974E-28).
    
    LikeLike
    
    Reply
Lori says:

10/05/2014 at 9:27 pm

Strange there is a discrepancy, albeit small. Two different setups (Office 2010/2013 64-Bit Windows 7/8) gave the same results in tests…
3.1415926535897928940 <= PI < 3.1415926535897933381
The difference at the boundaries is equal to 2^-51.

LikeLike

Reply
Georg says:

13/05/2014 at 5:35 am

Hi Doug,
if I remember correctly (IIRC), the currency/decimal tpyes are *fixed* point variable types supporting up to 28 decimal significant figures… long time ago, I played with a VBA memory hexdump routine and IIRC found out that each decimal figure is coded by one byte… I hope the next weekend will be rainy…
Back to VBA string to double parsing: I’ve defined the following three VBA functions:
Function VBA_Cdbl_of_Val_of_Var#(ByVal myX As Variant)
VBA_Cdbl_of_Val_of_Var = CDbl(Val(myX))
End Function
Function VBA_Cdbl_of_Var#(ByVal myX As Variant)
VBA_Cdbl_of_Var = CDbl(myX)
End Function
Function VBA_implicit_cast_on_input#(ByVal myX As Double)
VBA_implicit_cast_on_input = myX
End Function
I managed to get my hands on the following three XL installations: XL15 on W7Prof64, XL14 on W7Prof64, and XL10 on WXP32. I used the following multi-precision string for testing:
xN := square root of pi divided by 14195. In decimal notation, we thus have:
DP_lo 1.248646601553727252496839...E-4 xN 1.24864660155372738802266....E-4 sigFig 1234567890123456789012345 DP_tie 1.24864660155372738802211....E-4 DP_hi 1.248646601553727523547382...E-4
What you can see from this example is
1. you need at least 23 dec. sigFigs in order to enable an ideal parser to find the double nearest to a multi-precision numerical streing, which in this case turns out to be DP_hi.
2. you might need 17 dec. sigFigs in order to uniquely serialise a double precision value.
The results concerning string parsing to double are as follows:
A) The value function of all three XL versions from 2002 to 2013 definitely uses only the first 15 dec. sigFigs. of a string.
B) The VBA val function has been greatly improved from XL10 to XL14. In case of XL14 and XL15 (VBA 7), it supports at least 23 dec. sigFigs.
C) The VBA Cdbl function as well as the procedure involved when implicitly type casting a string on input have not been improved; they definitely use only the first 20 dec. sigFigs. of a string. (From results obtained using the example, you can only conclude that they do not use 23 sigFigs, but I did not want to write a novel here…)
Thanks for notifying me of problems with NumPY-PyXLL, I’ll have a look on that…
Lori, I’m sorry, but due Win8(.1) lacking a WinXP virtual machine, there is no chance for me to get my hands on such an XL installation. Perhaps, MS have decided to switch from double extended 80 bit x87 stack to 53 bit SSE in VBA in Win8(.), but that is just an inspired guess…

LikeLike

Reply
- Georg says:
  
  13/05/2014 at 5:49 am
  
  there should appear a space between the leading “1” and following “2” in the “sigFig” row… lost in blogging… sorry…
  
  LikeLike
  
  Reply
Georg says:

14/05/2014 at 6:17 pm

Hi Doug,
according to this thread:
http://stackoverflow.com/questions/3264828/change-default-float-print-format
12 decimal digits seem to be the default setting when casting a float into a string in case PY 2.x is used. Looks like you have two choices: modify the proper #define statement in the C source and recompile PY yourself, or you might modify the PY code in the output section of your FEM module, if it is open source. In PY2.7.x, you could use format(myFloat,’.16e’) instead of str(myFloat), for example, in order to return 17 significant decimal digits.
Using this formatting, I’ve just noticed that the PY 2.7.6 float(mpString) statement does not always yield double precision values that are closest to given multi-precision strings, arrrrrrgh

LikeLike

Reply
- dougaj4 says:
  
  24/05/2014 at 10:01 am
  
  Hi Georg,
  Thanks for your input on this. If you reach a resolution on these issues, I’d certainly be interested.
  
  LikeLike
  
  Reply
Pingback: Extracting numbers from text and regular expressions | Newton Excel Bach, not (just) an Excel Blog
Pingback: Dealing with dates 3: opening and saving csv files without data corruption | Newton Excel Bach, not (just) an Excel Blog
Pingback: Importing Text Files; Unix Format | Newton Excel Bach, not (just) an Excel Blog

	Monte Carlo simulati… on py_UMom 1-05
	Scipy-statistics wit… on Transferring different data ty…
	Scipy-statistics wit… on Scipy-statistics update
	dougaj4 on Spherical Geometry and Vincent…
	Pat Kirol on Spherical Geometry and Vincent…
	… and then py_… on py_xlCBA 0.04
	Alternative iterativ… on Newton-Raphson and Brent…
	Alternative iterativ… on Scipy functions with Excel and…
	Z on Downloads
	py_xlCBA – Sup… on py_xlCBA update
	dougaj4 on Downloads
	Z on Downloads
	py_xlCBA update \| Ne… on Calling PyCBA from Excel
	Z on Reinforced concrete elastic an…
	dougaj4 on Reinforced concrete elastic an…