regex wont separate last string

regex wont separate last string

I made a regex that should be able to separate specific order of numbers from a html file, but it just doesnt work in the last part. So this is how the html file prints out:

0430n 0500 20 40 53n 0606 19 32 45 58n 0711 22 33 44 55 n ... 2000 20 40n 2100 20 40n 2200 20 40n 2300 20 40n 0000n n

and this is my regex:

timeRegex = re.compile(r'''((dd)(dd) (n|(s (dd) s? (dd)? s? (dd)? s? (dd)? s? (dd)? )n)? )''',re.VERBOSE|re.DOTALL)

when looking at the list it works fine for the most part, until the last element in the list where it picks up the 0000 so it looks like this '2300 20 40n0000nn'
Please help out.

Are you asking why 0000 is matched? Your s? matches 1 or 0 whitespaces.
– Wiktor Stribiżew
yesterday

0000

s?

@WiktorStribiżew im confused as to why '2300 20 40n0000nn' is the last element in the list, and not just '0000n', not sure what I'm doing wrong because this doesnt happen anywhere else in the list
– BorkoP
yesterday

Do you have literal n in the file? I'm trying to understand why you show n before the line breaks.
– Barmar
yesterday

n

Have you tried putting it into regex101.com? It shows how all the capture groups are matching with color codes.
– Barmar
yesterday

Why do you use re.DOTALL? Aren't you parsing the file line-by-line?
– David Nemeskey
yesterday

re.DOTALL

2 Answers
2

When it gets to this part of the input:

2300 20 40n 0000n

It matches as follows:

(dd)(dd)

2300

s

(dd)

20

s?

(dd)?

40

s?

(dd)?

00

s?

(dd)?

00

s? (dd)?

n

I suspect you didn't realize that s matches any kind of whitespace, including newlines. If you want to match a space literally in a verbose regexp, write a space preceded by backslash. So most of those s? should be ?.

s

s?

?

Thanks bud, this worked, I didn't know about all of that
– BorkoP
yesterday

The reason is twofold:

s

s?

So what happens is one of your s?s eats the newline after the line 2300 20 40, and the next s? matches the missing whitespace in the middle of 0000. You don't see the problem happening in other places because you have one less s?(dd)? to cover two full lines; add one more to the regex and you will see the lines

s?

2300 20 40

s?

0000

s?(dd)?

2000 20 40n 2100 20 40n

imploded too.

I am not sure how you would like to parse this file, but judging from your code line-by-line. If so, "explicit is better than implicit":

time_regex = re.compile(r'^(d{4})(sd{2})*$') with open(...) as inf: for line in inf: m = time_regex.match(line) # Use m.group(1) and m.group(2).split()

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Xuykyuu