regex wont separate last string

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP


regex wont separate last string



I made a regex that should be able to separate specific order of numbers from a html file, but it just doesnt work in the last part. So this is how the html file prints out:


0430n
0500 20 40 53n
0606 19 32 45 58n
0711 22 33 44 55 n
...
2000 20 40n
2100 20 40n
2200 20 40n
2300 20 40n
0000n
n



and this is my regex:


timeRegex = re.compile(r'''((dd)(dd)
(n|(s
(dd)
s?
(dd)?
s?
(dd)?
s?
(dd)?
s?
(dd)?
)n)?
)''',re.VERBOSE|re.DOTALL)



when looking at the list it works fine for the most part, until the last element in the list where it picks up the 0000 so it looks like this '2300 20 40n0000nn'
Please help out.





Are you asking why 0000 is matched? Your s? matches 1 or 0 whitespaces.
– Wiktor Stribiżew
yesterday


0000


s?





@WiktorStribiżew im confused as to why '2300 20 40n0000nn' is the last element in the list, and not just '0000n', not sure what I'm doing wrong because this doesnt happen anywhere else in the list
– BorkoP
yesterday





Do you have literal n in the file? I'm trying to understand why you show n before the line breaks.
– Barmar
yesterday


n


n





Have you tried putting it into regex101.com? It shows how all the capture groups are matching with color codes.
– Barmar
yesterday





Why do you use re.DOTALL? Aren't you parsing the file line-by-line?
– David Nemeskey
yesterday


re.DOTALL




2 Answers
2



When it gets to this part of the input:


2300 20 40n
0000n



It matches as follows:


(dd)(dd)


2300


s


(dd)


20


s?


(dd)?


40


s?


(dd)?


00


s?


(dd)?


00


s? (dd)?


n



I suspect you didn't realize that s matches any kind of whitespace, including newlines. If you want to match a space literally in a verbose regexp, write a space preceded by backslash. So most of those s? should be ?.


s


s?


?





Thanks bud, this worked, I didn't know about all of that
– BorkoP
yesterday



The reason is twofold:


s


s?



So what happens is one of your s?s eats the newline after the line 2300 20 40, and the next s? matches the missing whitespace in the middle of 0000. You don't see the problem happening in other places because you have one less s?(dd)? to cover two full lines; add one more to the regex and you will see the lines


s?


2300 20 40


s?


0000


s?(dd)?


2000 20 40n
2100 20 40n



imploded too.



I am not sure how you would like to parse this file, but judging from your code line-by-line. If so, "explicit is better than implicit":


time_regex = re.compile(r'^(d{4})(sd{2})*$')
with open(...) as inf:
for line in inf:
m = time_regex.match(line)
# Use m.group(1) and m.group(2).split()






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

Keycloak server returning user_not_found error when user is already imported with LDAP

Using generate_series in ecto and passing a value

PHP parse/syntax errors; and how to solve them?