bash, awk and or sed to clean up string with special formatting - solved

bash, awk and or sed to clean up string with special formatting - solved

Within a script I am working on, I have to clean up a string to the format I need.

Structure of every string: (zip code, street name, number, extension):

Eventually followed by

The resulting string should be
4 digits, 2 letters, the number and in case of an extension followed by x and the letter or digit of the extension

Below some examples:

I started with

echo "1019RXJavakade254" | awk '{print substr($0,0,6)}'

to get the zip code
and after that I think I should use a "print match", but I can't get it right from there.

The strings are passed individually and used in the next step of the script. Originally they come from a csv file, but the (combination of) column(s) the string is coming from is always different. The first part of the script is handling that and creates this source string. The resulting string will be placed back in a column which I can add as the last column to the original csv file

I'm aware of the problems regarding numbers after the first 6 characters and if an extension is present. So in my opinion the workflow should be something like: First 6 characters should be 4 digits, 2 letters, if not total result is empty. Skip character 7 and 8 and grab the first group of digits you encounter after character 8, that is the number and everything else after that is the extension. The extension never starts directly with a digit. Only in case of an extension there is an x in between. The extension should be stripped of other then alphanumeric characters.

This should cover the most, the rest wil have a delay in delivery :)

========================================================

@kvantour Thanks for your answer. I slightly changed the code to get the non capital letters too. The result is part of a larger applescript which runs unattended on a Xserve here in the company.
So the code I use now is

set KixCodeSourceClean to do shell script "echo " & KixCodeSource & " | awk '/^[0-9]{4}[a-zA-Z]{2}.+[0-9]+[- ].+$/{match(substr($0,8),/[0-9]+[- ].+$/);s=substr($0,7+RSTART,RLENGTH); sub(/[- ]/,"x",s);print substr($0,1,6)s;next} /^[0-9]{4}[a-zA-Z]{2}.+[0-9]+[a-zA-Z].*$/{match(substr($0,8),/[0-9]+[a-zA-Z].*$/);s=substr($0,7+RSTART,RLENGTH);match(s,/[0-9]+/);print substr($0,1,6)substr(s,1,RLENGTH)"x"substr(s,RLENGTH+1);next} /^[0-9]{4}[a-zA-Z]{2}.+[0-9]+$/{ match(substr($0,8),/[0-9]+$/);s=substr($0,7+RSTART);print substr($0,1,6)s;next}'"

It works perfectly and is a oneliner, which I prefer in this case. I use this method a lot. Jumping in and out Applescript and use the unix shell to solve things faster.

Are these strings variables passed to your script individually or do they come from a file, all at once (e.g. each on a separate line)?
– Tom Fenech
7 hours ago

The strings are passed individually and used in the next step.
– JB Veenstra
6 hours ago

Problems occur with names such as "1066EC1eLouwesweg6-F"
– kvantour
6 hours ago

You need to be more specific about the extension. You say it is separated "by a dash, a space or something else", but I have a feeling that this "something else" is going to be a source of problems.
– Tom Fenech
6 hours ago

@TomFenech Also don't forget his penultimate example, which shows an extention witout anything.
– kvantour
6 hours ago

3 Answers
3

The idea I had in mind was an exclusion principle in which we test one-possibility after another:

NNNNXXabc123efgMMM-SUF

NNNNXXabc123efgMMM SUF

NNNNXXabc123efgMMMSUF

NNNNXXabc123efgMMM

The problem, however, is that SUF can be anything and abc123efg can be anything. As a consequence, the example "1066EC1eLouwesweg6" will match the second case.

SUF

abc123efg

To avoid this, I was thinking to have a look at the conditions for street names, but in the Netherlands, these can be anything:

'

So there is not even a condition on the length of the street name, except, if it is one character long, it is a letter.

So this gave me the following AWK:

{gsub(/r/,"",$0)} # removes `r` if any /^[0-9][0-9][0-9][0-9][A-Z][A-Z].+[0-9]+[- ].+$/{match(substr($0,8),/[0-9]+[- ].+$/);s=substr($0,7+RSTART,RLENGTH); sub(/[- ]/,"x",s);print substr($0,1,6)s;next} /^[0-9][0-9][0-9][0-9][A-Z][A-Z].+[0-9]+[a-zA-Z].*$/{match(substr($0,8),/[0-9]+[a-zA-Z].*$/);s=substr($0,7+RSTART,RLENGTH);match(s,/[0-9]+/);print substr($0,1,6)substr(s,1,RLENGTH)"x"substr(s,RLENGTH+1);next} /^[0-9][0-9][0-9][0-9][A-Z][A-Z].+[0-9]+$/{ match(substr($0,8),/[0-9]+$/);s=substr($0,7+RSTART);print substr($0,1,6)s;next}

And on this input file:

1019RXJavakade254 1019PGBogortuin50 1079THEemsstraat34-II 1066EC1eLouwesweg6 1019LCKNSM-laan193 1019WZScheepstimmermanstraat74 2288EASirWinstonChurchillaan275-F126 1056HZMaartenHarpertszoonTrompstraat12-3hg 1092GRLaing'snekstraat15G F-30700RueduLavoir1

It gave me the following output:

1019RX254 1019PG50 1079TH34xII 1066EC6 1019LC193 1019WZ74 2288EA275xF126 1056HZ12x3hg 1092GR15xG

As you notice, the last one is not matched!

However, I cannot assure you that this will work 100%.

fun fact: In Ottoland, you can travel from A to B by crossing a bridge of 10m.

Several requirement to extract the zip code and the extension, so pipeline the results to additional sed be deployed here.

sed

$ str="1066EC1eLouwesweg6" $ sed -r 's/(^[0-9]{4}[A-Z]{2})..[^0-9]*(.*)/12/' <<< "$str" | sed 's/-/x/' | sed -r '/[^x]/ s/(.*[0-9]+)([A-Z]+$)/1x2/' 1066EC6

Brief explanation,

sed -r 's/(^[0-9]{4}[A-Z]{2})..[^0-9]*(.*)/12/' <<< "$str"

sed 's/-/x/'

sed -r '/[^x]/ s/(.*[0-9]+)([A-Z]+$)/1x2/'

This (using GNU awk for the 3rd arg to match() and gensub()) will produce the expected output from the input you provided:

$ cat tst.awk match($1,/^([0-9]{4}[[:alpha:]]{2})(..[^0-9]+)(.*)/,a) { if ( ! sub(/[^[:alnum:]]/,"x",a[3]) ) { a[3] = gensub(/([0-9])([[:alpha:]])/,"\1x\2",1,a[3]) } } { tgt = (1 in a ? a[1] a[3] : "nothing") print tgt, (tgt == $NF ? "succ" : "fail") } $ awk -f tst.awk file 1019RX254 succ 1019PG50 succ 1079TH34xII succ 1066EC6 succ 1019LC193 succ 1019WZ74 succ 2288EA275xF126 succ 1056HZ12x3hg succ 1092GR15xG succ nothing succ

It will fail if a digit can appear in the street name anywhere other than the first 2 characters.

The above was run on this input file and prints succ/fail after every result based on whether or not the result matches the expected result from the last field of the input file:

$ cat file 1019RXJavakade254 -result: 1019RX254 1019PGBogortuin50 -result: 1019PG50 1079THEemsstraat34-II -result: 1079TH34xII 1066EC1eLouwesweg6 -result: 1066EC6 1019LCKNSM-laan193 -result: 1019LC193 1019WZScheepstimmermanstraat74 -result: 1019WZ74 2288EASirWinstonChurchillaan275-F126 -result: 2288EA275xF126 1056HZMaartenHarpertszoonTrompstraat12-3hg -result: 1056HZ12x3hg 1092GRLaing'snekstraat15G -result: 1092GR15xG F-30700RueduLavoir1 -result: nothing

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Xuykyuu