bash, awk and or sed to clean up string with special formatting - solved
Within a script I am working on, I have to clean up a string to the format I need.
Structure of every string: (zip code, street name, number, extension):
Eventually followed by
The resulting string should be
4 digits, 2 letters, the number and in case of an extension followed by x and the letter or digit of the extension
Below some examples:
I started with
echo "1019RXJavakade254" | awk '{print substr($0,0,6)}'
to get the zip code
and after that I think I should use a "print match", but I can't get it right from there.
The strings are passed individually and used in the next step of the script. Originally they come from a csv file, but the (combination of) column(s) the string is coming from is always different. The first part of the script is handling that and creates this source string. The resulting string will be placed back in a column which I can add as the last column to the original csv file
I'm aware of the problems regarding numbers after the first 6 characters and if an extension is present. So in my opinion the workflow should be something like: First 6 characters should be 4 digits, 2 letters, if not total result is empty. Skip character 7 and 8 and grab the first group of digits you encounter after character 8, that is the number and everything else after that is the extension. The extension never starts directly with a digit. Only in case of an extension there is an x in between. The extension should be stripped of other then alphanumeric characters.
This should cover the most, the rest wil have a delay in delivery :)
========================================================
@kvantour Thanks for your answer. I slightly changed the code to get the non capital letters too. The result is part of a larger applescript which runs unattended on a Xserve here in the company.
So the code I use now is
set KixCodeSourceClean to do shell script "echo " & KixCodeSource & " | awk '/^[0-9]{4}[a-zA-Z]{2}.+[0-9]+[- ].+$/{match(substr($0,8),/[0-9]+[- ].+$/);s=substr($0,7+RSTART,RLENGTH); sub(/[- ]/,"x",s);print substr($0,1,6)s;next} /^[0-9]{4}[a-zA-Z]{2}.+[0-9]+[a-zA-Z].*$/{match(substr($0,8),/[0-9]+[a-zA-Z].*$/);s=substr($0,7+RSTART,RLENGTH);match(s,/[0-9]+/);print substr($0,1,6)substr(s,1,RLENGTH)"x"substr(s,RLENGTH+1);next} /^[0-9]{4}[a-zA-Z]{2}.+[0-9]+$/{ match(substr($0,8),/[0-9]+$/);s=substr($0,7+RSTART);print substr($0,1,6)s;next}'"
It works perfectly and is a oneliner, which I prefer in this case. I use this method a lot. Jumping in and out Applescript and use the unix shell to solve things faster.
3 Answers
3
The idea I had in mind was an exclusion principle in which we test one-possibility after another:
NNNNXXabc123efgMMM-SUF
NNNNXXabc123efgMMM SUF
NNNNXXabc123efgMMMSUF
NNNNXXabc123efgMMM
The problem, however, is that SUF
can be anything and abc123efg
can be anything. As a consequence, the example "1066EC1eLouwesweg6" will match the second case.
SUF
abc123efg
To avoid this, I was thinking to have a look at the conditions for street names, but in the Netherlands, these can be anything:
'
So there is not even a condition on the length of the street name, except, if it is one character long, it is a letter.
So this gave me the following AWK:
{gsub(/r/,"",$0)} # removes `r` if any
/^[0-9][0-9][0-9][0-9][A-Z][A-Z].+[0-9]+[- ].+$/{match(substr($0,8),/[0-9]+[- ].+$/);s=substr($0,7+RSTART,RLENGTH); sub(/[- ]/,"x",s);print substr($0,1,6)s;next}
/^[0-9][0-9][0-9][0-9][A-Z][A-Z].+[0-9]+[a-zA-Z].*$/{match(substr($0,8),/[0-9]+[a-zA-Z].*$/);s=substr($0,7+RSTART,RLENGTH);match(s,/[0-9]+/);print substr($0,1,6)substr(s,1,RLENGTH)"x"substr(s,RLENGTH+1);next}
/^[0-9][0-9][0-9][0-9][A-Z][A-Z].+[0-9]+$/{ match(substr($0,8),/[0-9]+$/);s=substr($0,7+RSTART);print substr($0,1,6)s;next}
And on this input file:
1019RXJavakade254
1019PGBogortuin50
1079THEemsstraat34-II
1066EC1eLouwesweg6
1019LCKNSM-laan193
1019WZScheepstimmermanstraat74
2288EASirWinstonChurchillaan275-F126
1056HZMaartenHarpertszoonTrompstraat12-3hg
1092GRLaing'snekstraat15G
F-30700RueduLavoir1
It gave me the following output:
1019RX254
1019PG50
1079TH34xII
1066EC6
1019LC193
1019WZ74
2288EA275xF126
1056HZ12x3hg
1092GR15xG
As you notice, the last one is not matched!
However, I cannot assure you that this will work 100%.
fun fact: In Ottoland, you can travel from A to B by crossing a bridge of 10m.
Several requirement to extract the zip code and the extension, so pipeline the results to additional sed
be deployed here.
sed
$ str="1066EC1eLouwesweg6"
$ sed -r 's/(^[0-9]{4}[A-Z]{2})..[^0-9]*(.*)/12/' <<< "$str" | sed 's/-/x/' | sed -r '/[^x]/ s/(.*[0-9]+)([A-Z]+$)/1x2/'
1066EC6
Brief explanation,
sed -r 's/(^[0-9]{4}[A-Z]{2})..[^0-9]*(.*)/12/' <<< "$str"
sed 's/-/x/'
sed -r '/[^x]/ s/(.*[0-9]+)([A-Z]+$)/1x2/'
This (using GNU awk for the 3rd arg to match() and gensub()) will produce the expected output from the input you provided:
$ cat tst.awk
match($1,/^([0-9]{4}[[:alpha:]]{2})(..[^0-9]+)(.*)/,a) {
if ( ! sub(/[^[:alnum:]]/,"x",a[3]) ) {
a[3] = gensub(/([0-9])([[:alpha:]])/,"\1x\2",1,a[3])
}
}
{
tgt = (1 in a ? a[1] a[3] : "nothing")
print tgt, (tgt == $NF ? "succ" : "fail")
}
$ awk -f tst.awk file
1019RX254 succ
1019PG50 succ
1079TH34xII succ
1066EC6 succ
1019LC193 succ
1019WZ74 succ
2288EA275xF126 succ
1056HZ12x3hg succ
1092GR15xG succ
nothing succ
It will fail if a digit can appear in the street name anywhere other than the first 2 characters.
The above was run on this input file and prints succ/fail after every result based on whether or not the result matches the expected result from the last field of the input file:
$ cat file
1019RXJavakade254 -result: 1019RX254
1019PGBogortuin50 -result: 1019PG50
1079THEemsstraat34-II -result: 1079TH34xII
1066EC1eLouwesweg6 -result: 1066EC6
1019LCKNSM-laan193 -result: 1019LC193
1019WZScheepstimmermanstraat74 -result: 1019WZ74
2288EASirWinstonChurchillaan275-F126 -result: 2288EA275xF126
1056HZMaartenHarpertszoonTrompstraat12-3hg -result: 1056HZ12x3hg
1092GRLaing'snekstraat15G -result: 1092GR15xG
F-30700RueduLavoir1 -result: nothing
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Are these strings variables passed to your script individually or do they come from a file, all at once (e.g. each on a separate line)?
– Tom Fenech
7 hours ago