@Quazz @Tom-Elliott I have done some testing on my own now and this is definitely unicode hell! Short story is I think we should not rely on those label checks anymore as they can go wrong so easily with non ASCII characters. Here you go with a bit of unicode fun in the client shell:
[Mon Dec 03 root@fogclient ~]# e=$(echo -ne '\xC3\xA9')
[Mon Dec 03 root@fogclient ~]# E=$(echo -ne '\xC3\x89')
[Mon Dec 03 root@fogclient ~]# label=$(echo -ne 'R\xC3\xA9serv\xC3\xA9_au_syst')
[Mon Dec 03 root@fogclient ~]# echo $e
é
[Mon Dec 03 root@fogclient ~]# echo $E
É
[Mon Dec 03 root@fogclient ~]# echo $label
Réservé_au_syst
Ok that’s for starters just to get the right characters set in variables as I can’t seem to enter those using my keyboard in a ssh session on a client (neither can I in the VM terminal). So I suppose bash and the underlaying libs are able to display unicode characters but it’s not fully supported anyhow.
Important: This is using the UTF-8 codes for é
but there are other encoding standards like ISO-8859-1 through to ISO-8859-15 and many more that may encode the very same character with different codes. Or let me say it the other way round. If we read that label the returned string might be using different unicodes than we had used in the scripts although the characters look identical to our eyes it would still not match. So here comes the fun part:
[Mon Dec 03 root@fogclient ~]# if [[ $label =~ [Rr][Ee$E$e] ]]; then echo "JA"; fi
JA
So using the variables in the bash regex actually does work. But…
[Mon Dec 03 root@fogclient ~]# if [[ $label =~ [Rr][Ee$E$e][Ss] ]]; then echo "JA"; fi
What?!? I simply added [Ss]
which should match, shouldn’t it? Ok let’s try to skip the special character for now.
[Mon Dec 03 root@fogclient ~]# if [[ $label =~ [Rr].[Ss] ]]; then echo "JA"; fi
[Mon Dec 03 root@fogclient ~]# if [[ $label =~ [Rr]..[Ss] ]]; then echo "JA"; fi
JA
[Mon Dec 03 root@fogclient ~]# if [[ $label =~ [Rr][Ee$E$e].[Ss] ]]; then echo "JA"; fi
JA
Crazy stuff. So this special character ends up being two characters when doing bash regex. I still have no idea what that extra character might be and how to find it other than using .
as any character. I guess it stems from our buildroot bash only partially supporting UTF-8 unicode. Anyhow, this is how my new regex looks like:
[Mon Dec 03 root@fogclient ~]# if [[ $label =~ [Rr][Ee$E$e].[Ss][Ee][Rr][Vv][Ee$E$e].[Dd]? ]]; then echo "JA"; fi
JA
And exactly the same if we use grep
instead:
[Mon Dec 03 root@fogclient ~]# echo $label | grep "[Rr][Ee$E$e][Ss][Ee][Rr][Vv][Ee$E$e]"
[Mon Dec 03 root@fogclient ~]# echo $label | grep "[Rr][Ee$E$e].[Ss][Ee][Rr][Vv][Ee$E$e].[Dd]*"
Réservé_au_syst
Same ugly hack I reckon. And please keep in mind that this could fail if some Windows installations were made using ISO-8859-1 code pages. So to sum it all up. Let’s move forward and not waste any more time to find the perfect regex matching all the labels out there.
We have started to gather information on that stuff and I think we should tackle it now and see if it works any better: https://github.com/FOGProject/fos/issues/18