Testing for russian spam

xyzzy · 14 Jan 2019, 05:40 AM

I got a Russian spam where the From name was in Cyrillic and it's slipping under my spam threshold. Since I'm a bit "obsessed" with sieve these days I thought I would write a test to check for Russian names. The following test condition works:

Code:

header :regex "From" "(^|,)[[:space:]]*\"?[^<]*[а-яА-ЯЁё]+[^\"<]*\"?[[:space:]]*<")

And so does this

Code:

header :regex "From" "(^|,)[[:space:]]*\"?[^<]*[аАбБвВгГдДеЕёЁжЖзЗиИйЙкКлЛмМнНоОпПрРсСтТуУфФхХцЦчЧшШщЩъЪыЫьЬэЭюЮяЯ]+[^\"<]*\"?[[:space:]]*<")

I found these by a little bit of google searching. It's checking the From name for one or more Cyrillic characters.

The actual full Unicode range for Cyrillic is 0x400-0x4ff and I would much rather specify this hex range in the regex but I cannot figure how to do it.

RFC5228, section 2.4.2.4 indicates I can write hex numbers like ${hex:400} or Unicode as ${unicode:400}. But if I attempt to write a regex range like [${unicode:400}-${unicode:4ff}] I will get a syntax error. So is there some sytax that allows a hex or Unicode number range in a sieve regex or am I "stuck" checking the explicit characters?

gardenweed · 14 Jan 2019, 06:22 AM

Looks interesting.
Can you explain that code that follows the "From"

xyzzy · 14 Jan 2019, 09:07 AM

If I was to use the UI to generate an organize rule of the form:

The senderʼs name matches glob pattern abc

then the sieve code FM generates for the test is,

Code:

header :regex "From" "(^|,)[[:space:]]*\"?abc\"?[[:space:]]*<"

it doesn't actually generate a glob match but a regex instead because the actual From header looks something like,

From: abc <foo@domain.tlc>

The pattern match for a header sieve command starts after the colon. So starting after the From: the pattern ignores leading spaces or spaces after a comma followed by what to look for optionally enclosed in quotes (if name has spaces of its own), followed by any number of spaces before the email address which starts after the '<'.

Apparently there must be cases where more than one name can be specified and are comma separated. It's the only reason I can think of for looking for commas. Using the UI organize rules avoids a lot of mistakes and saves time constructing these things. I always keep a disabled organize rule laying around just for this purpose.

I took this pattern and replaced the abc portion with a pattern match for one or more (the + sign) Cyrillic characters. In other words [а-яА-ЯЁё]+ or all those characters enumerated.

I would prefer to use the entire Unicode range of Unicode Cyrillic possibilities, i.e., 0x400 to 0x4ff, so that was the reason for my question. I don't know if this is even syntactically possible in sieve regex. Certainly what I tried so far isn't. I posted here hoping someone might know the magic syntax that works if any.

Update:
As I was writing that last paragraph I started thinking about whether there are actually Unicode characters across that entire range for Cyrillic. Looking at a Unicode table for Cyrillic (here) I discovered that there was. So [Ѐ-ӿ]+ should work. Not sure why the web pages I google searched didn't show that. Maybe because others showed the hex range instead. Still want to know if I can do that.

gardenweed · 14 Jan 2019, 11:02 AM

Thanks for sharing the methodology.

BritTim · 14 Jan 2019, 04:45 PM

Did you include

Code:

require "encoded-character";

at the top of your script?

xyzzy · 14 Jan 2019, 05:39 PM

Thank you BritTim. Good catch!

That was the missing piece of the puzzle. I didn't even notice that line in the example in the spec or the comment a little above about the require encoded-character.

With it added this is the range format that appears to work (testing all this with Sieve Tester).

Code:

[${unicode:400}-${unicode:4ff}]+

So I now understand that and in addition learned two other things along the way. First the require command needs to be before any other statements, i.e., at the top as you pointed out. Second, I thought that the require extension list generated by FM was all the sieve extensions FM supported and you couldn't use any others (like encoded-character). Obviously that isn't correct.

Another reason why I assumed you couldn't add others was because Sieve Tester keeps erroring out the fcc extension since it is not implemented in Sieve Tester itself. I wish FM would fix that since I always need to delete that fcc when I copy/paste my script into there. Yes, I submitted a ticket on it some time ago. Sieve Tester is obviously not very high priority since they think not many users actually write Sieve stuff. They're probably right too.

I think the reason Sieve Tester errors out the require fcc but not encoded-character is that encoded-character is part of the base Sieve standard (RFC5228) and I guess must be implemented where fcc is not in the base standard.

Again thanks.

14 Jan 2019, 05:40 AM	#1
xyzzy Essential Contributor Join Date: May 2018 Posts: 474	Testing for russian spam I got a Russian spam where the From name was in Cyrillic and it's slipping under my spam threshold. Since I'm a bit "obsessed" with sieve these days I thought I would write a test to check for Russian names. The following test condition works: Code: header :regex "From" "(^\|,)[[:space:]]\"?[^<][а-яА-ЯЁё]+[^\"<]\"?[[:space:]]<") And so does this Code: header :regex "From" "(^\|,)[[:space:]]\"?[^<][аАбБвВгГдДеЕёЁжЖзЗиИйЙкКлЛмМнНоОпПрРсСтТуУфФхХцЦчЧшШщЩъЪыЫьЬэЭюЮяЯ]+[^\"<]\"?[[:space:]]<") I found these by a little bit of google searching. It's checking the From name for one or more Cyrillic characters. The actual full Unicode range for Cyrillic is 0x400-0x4ff and I would much rather specify this hex range in the regex but I cannot figure how to do it. RFC5228, section 2.4.2.4 indicates I can write hex numbers like ${hex:400} or Unicode as ${unicode:400}. But if I attempt to write a regex range like [${unicode:400}-${unicode:4ff}] I will get a syntax error. So is there some sytax that allows a hex or Unicode number range in a sieve regex or am I "stuck" checking the explicit characters?

14 Jan 2019, 09:07 AM	#3
xyzzy Essential Contributor Join Date: May 2018 Posts: 474	If I was to use the UI to generate an organize rule of the form: The senderʼs name matches glob pattern abc then the sieve code FM generates for the test is, Code: header :regex "From" "(^\|,)[[:space:]]\"?abc\"?[[:space:]]<" it doesn't actually generate a glob match but a regex instead because the actual From header looks something like, From: abc <foo@domain.tlc> The pattern match for a header sieve command starts after the colon. So starting after the From: the pattern ignores leading spaces or spaces after a comma followed by what to look for optionally enclosed in quotes (if name has spaces of its own), followed by any number of spaces before the email address which starts after the '<'. Apparently there must be cases where more than one name can be specified and are comma separated. It's the only reason I can think of for looking for commas. Using the UI organize rules avoids a lot of mistakes and saves time constructing these things. I always keep a disabled organize rule laying around just for this purpose. I took this pattern and replaced the abc portion with a pattern match for one or more (the + sign) Cyrillic characters. In other words [а-яА-ЯЁё]+ or all those characters enumerated. I would prefer to use the entire Unicode range of Unicode Cyrillic possibilities, i.e., 0x400 to 0x4ff, so that was the reason for my question. I don't know if this is even syntactically possible in sieve regex. Certainly what I tried so far isn't. I posted here hoping someone might know the magic syntax that works if any. Update: As I was writing that last paragraph I started thinking about whether there are actually Unicode characters across that entire range for Cyrillic. Looking at a Unicode table for Cyrillic (here) I discovered that there was. So [Ѐ-ӿ]+ should work. Not sure why the web pages I google searched didn't show that. Maybe because others showed the hex range instead. Still want to know if I can do that. Last edited by xyzzy : 14 Jan 2019 at 09:41 AM.

14 Jan 2019, 04:45 PM	#5
BritTim The "e" in e-mail Join Date: May 2003 Location: mostly in Thailand Posts: 3,090	Did you include Code: require "encoded-character"; at the top of your script?

14 Jan 2019, 05:39 PM	#6
xyzzy Essential Contributor Join Date: May 2018 Posts: 474	Thank you BritTim. Good catch! That was the missing piece of the puzzle. I didn't even notice that line in the example in the spec or the comment a little above about the require encoded-character. With it added this is the range format that appears to work (testing all this with Sieve Tester). Code: [${unicode:400}-${unicode:4ff}]+ So I now understand that and in addition learned two other things along the way. First the require command needs to be before any other statements, i.e., at the top as you pointed out. Second, I thought that the require extension list generated by FM was all the sieve extensions FM supported and you couldn't use any others (like encoded-character). Obviously that isn't correct. Another reason why I assumed you couldn't add others was because Sieve Tester keeps erroring out the fcc extension since it is not implemented in Sieve Tester itself. I wish FM would fix that since I always need to delete that fcc when I copy/paste my script into there. Yes, I submitted a ticket on it some time ago. Sieve Tester is obviously not very high priority since they think not many users actually write Sieve stuff. They're probably right too. I think the reason Sieve Tester errors out the require fcc but not encoded-character is that encoded-character is part of the base Sieve standard (RFC5228) and I guess must be implemented where fcc is not in the base standard. Again thanks.

14 Jan 2019, 06:22 AM	#2
gardenweed Cornerstone of the Community Join Date: Jun 2008 Location: Perth Posts: 664	Looks interesting. Can you explain that code that follows the "From"

14 Jan 2019, 11:02 AM	#4
gardenweed Cornerstone of the Community Join Date: Jun 2008 Location: Perth Posts: 664	Thanks for sharing the methodology.