bash, awk and or sed to clean up string with special formatting - solved

The name of the picture


bash, awk and or sed to clean up string with special formatting - solved



Within a script I am working on, I have to clean up a string to the format I need.



Structure of every string: (zip code, street name, number, extension):



Eventually followed by



The resulting string should be
4 digits, 2 letters, the number and in case of an extension followed by x and the letter or digit of the extension



Below some examples:



I started with


echo "1019RXJavakade254" | awk '{print substr($0,0,6)}'



to get the zip code
and after that I think I should use a "print match", but I can't get it right from there.



The strings are passed individually and used in the next step of the script. Originally they come from a csv file, but the (combination of) column(s) the string is coming from is always different. The first part of the script is handling that and creates this source string. The resulting string will be placed back in a column which I can add as the last column to the original csv file



I'm aware of the problems regarding numbers after the first 6 characters and if an extension is present. So in my opinion the workflow should be something like: First 6 characters should be 4 digits, 2 letters, if not total result is empty. Skip character 7 and 8 and grab the first group of digits you encounter after character 8, that is the number and everything else after that is the extension. The extension never starts directly with a digit. Only in case of an extension there is an x in between. The extension should be stripped of other then alphanumeric characters.



This should cover the most, the rest wil have a delay in delivery :)



========================================================



@kvantour Thanks for your answer. I slightly changed the code to get the non capital letters too. The result is part of a larger applescript which runs unattended on a Xserve here in the company.
So the code I use now is


set KixCodeSourceClean to do shell script "echo " & KixCodeSource & " | awk '/^[0-9]{4}[a-zA-Z]{2}.+[0-9]+[- ].+$/{match(substr($0,8),/[0-9]+[- ].+$/);s=substr($0,7+RSTART,RLENGTH); sub(/[- ]/,"x",s);print substr($0,1,6)s;next} /^[0-9]{4}[a-zA-Z]{2}.+[0-9]+[a-zA-Z].*$/{match(substr($0,8),/[0-9]+[a-zA-Z].*$/);s=substr($0,7+RSTART,RLENGTH);match(s,/[0-9]+/);print substr($0,1,6)substr(s,1,RLENGTH)"x"substr(s,RLENGTH+1);next} /^[0-9]{4}[a-zA-Z]{2}.+[0-9]+$/{ match(substr($0,8),/[0-9]+$/);s=substr($0,7+RSTART);print substr($0,1,6)s;next}'"



It works perfectly and is a oneliner, which I prefer in this case. I use this method a lot. Jumping in and out Applescript and use the unix shell to solve things faster.





Are these strings variables passed to your script individually or do they come from a file, all at once (e.g. each on a separate line)?
– Tom Fenech
7 hours ago





The strings are passed individually and used in the next step.
– JB Veenstra
6 hours ago







Problems occur with names such as "1066EC1eLouwesweg6-F"
– kvantour
6 hours ago





You need to be more specific about the extension. You say it is separated "by a dash, a space or something else", but I have a feeling that this "something else" is going to be a source of problems.
– Tom Fenech
6 hours ago





@TomFenech Also don't forget his penultimate example, which shows an extention witout anything.
– kvantour
6 hours ago




3 Answers
3



The idea I had in mind was an exclusion principle in which we test one-possibility after another:


NNNNXXabc123efgMMM-SUF


NNNNXXabc123efgMMM SUF


NNNNXXabc123efgMMMSUF


NNNNXXabc123efgMMM



The problem, however, is that SUF can be anything and abc123efg can be anything. As a consequence, the example "1066EC1eLouwesweg6" will match the second case.


SUF


abc123efg



To avoid this, I was thinking to have a look at the conditions for street names, but in the Netherlands, these can be anything:


'



So there is not even a condition on the length of the street name, except, if it is one character long, it is a letter.



So this gave me the following AWK:


{gsub(/r/,"",$0)} # removes `r` if any
/^[0-9][0-9][0-9][0-9][A-Z][A-Z].+[0-9]+[- ].+$/{match(substr($0,8),/[0-9]+[- ].+$/);s=substr($0,7+RSTART,RLENGTH); sub(/[- ]/,"x",s);print substr($0,1,6)s;next}
/^[0-9][0-9][0-9][0-9][A-Z][A-Z].+[0-9]+[a-zA-Z].*$/{match(substr($0,8),/[0-9]+[a-zA-Z].*$/);s=substr($0,7+RSTART,RLENGTH);match(s,/[0-9]+/);print substr($0,1,6)substr(s,1,RLENGTH)"x"substr(s,RLENGTH+1);next}
/^[0-9][0-9][0-9][0-9][A-Z][A-Z].+[0-9]+$/{ match(substr($0,8),/[0-9]+$/);s=substr($0,7+RSTART);print substr($0,1,6)s;next}



And on this input file:


1019RXJavakade254
1019PGBogortuin50
1079THEemsstraat34-II
1066EC1eLouwesweg6
1019LCKNSM-laan193
1019WZScheepstimmermanstraat74
2288EASirWinstonChurchillaan275-F126
1056HZMaartenHarpertszoonTrompstraat12-3hg
1092GRLaing'snekstraat15G
F-30700RueduLavoir1



It gave me the following output:


1019RX254
1019PG50
1079TH34xII
1066EC6
1019LC193
1019WZ74
2288EA275xF126
1056HZ12x3hg
1092GR15xG



As you notice, the last one is not matched!



However, I cannot assure you that this will work 100%.



fun fact: In Ottoland, you can travel from A to B by crossing a bridge of 10m.



Several requirement to extract the zip code and the extension, so pipeline the results to additional sed be deployed here.


sed


$ str="1066EC1eLouwesweg6"
$ sed -r 's/(^[0-9]{4}[A-Z]{2})..[^0-9]*(.*)/12/' <<< "$str" | sed 's/-/x/' | sed -r '/[^x]/ s/(.*[0-9]+)([A-Z]+$)/1x2/'
1066EC6



Brief explanation,


sed -r 's/(^[0-9]{4}[A-Z]{2})..[^0-9]*(.*)/12/' <<< "$str"


sed 's/-/x/'


sed -r '/[^x]/ s/(.*[0-9]+)([A-Z]+$)/1x2/'



This (using GNU awk for the 3rd arg to match() and gensub()) will produce the expected output from the input you provided:


$ cat tst.awk
match($1,/^([0-9]{4}[[:alpha:]]{2})(..[^0-9]+)(.*)/,a) {
if ( ! sub(/[^[:alnum:]]/,"x",a[3]) ) {
a[3] = gensub(/([0-9])([[:alpha:]])/,"\1x\2",1,a[3])
}
}
{
tgt = (1 in a ? a[1] a[3] : "nothing")
print tgt, (tgt == $NF ? "succ" : "fail")
}

$ awk -f tst.awk file
1019RX254 succ
1019PG50 succ
1079TH34xII succ
1066EC6 succ
1019LC193 succ
1019WZ74 succ
2288EA275xF126 succ
1056HZ12x3hg succ
1092GR15xG succ
nothing succ



It will fail if a digit can appear in the street name anywhere other than the first 2 characters.



The above was run on this input file and prints succ/fail after every result based on whether or not the result matches the expected result from the last field of the input file:


$ cat file
1019RXJavakade254 -result: 1019RX254
1019PGBogortuin50 -result: 1019PG50
1079THEemsstraat34-II -result: 1079TH34xII
1066EC1eLouwesweg6 -result: 1066EC6
1019LCKNSM-laan193 -result: 1019LC193
1019WZScheepstimmermanstraat74 -result: 1019WZ74
2288EASirWinstonChurchillaan275-F126 -result: 2288EA275xF126
1056HZMaartenHarpertszoonTrompstraat12-3hg -result: 1056HZ12x3hg
1092GRLaing'snekstraat15G -result: 1092GR15xG
F-30700RueduLavoir1 -result: nothing






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

Keycloak server returning user_not_found error when user is already imported with LDAP

Using generate_series in ecto and passing a value

PHP parse/syntax errors; and how to solve them?