EmailDiscussions.com  

Go Back   EmailDiscussions.com > Email Service Provider-specific Forums > FastMail Forum
Register FAQ Members List Calendar Search Today's Posts Mark Forums Read
Stay in touch wirelessly

FastMail Forum All posts relating to FastMail.FM should go here: suggestions, comments, requests for help, complaints, technical issues etc.

Reply
 
Thread Tools
Old 14 Jan 2019, 05:40 AM   #1
xyzzy
Senior Member
 
Join Date: May 2018
Posts: 146
Testing for russian spam

I got a Russian spam where the From name was in Cyrillic and it's slipping under my spam threshold. Since I'm a bit "obsessed" with sieve these days I thought I would write a test to check for Russian names. The following test condition works:
Code:
header :regex "From" "(^|,)[[:space:]]*\"?[^<]*[а-яА-ЯЁё]+[^\"<]*\"?[[:space:]]*<")
And so does this

Code:
header :regex "From" "(^|,)[[:space:]]*\"?[^<]*[аАбБвВгГдДеЕёЁжЖзЗиИйЙкКлЛмМнНоОпПрРсСтТуУфФхХцЦчЧшШщЩъЪыЫьЬэЭюЮяЯ]+[^\"<]*\"?[[:space:]]*<")
I found these by a little bit of google searching. It's checking the From name for one or more Cyrillic characters.

The actual full Unicode range for Cyrillic is 0x400-0x4ff and I would much rather specify this hex range in the regex but I cannot figure how to do it.

RFC5228, section 2.4.2.4 indicates I can write hex numbers like ${hex:400} or Unicode as ${unicode:400}. But if I attempt to write a regex range like [${unicode:400}-${unicode:4ff}] I will get a syntax error. So is there some sytax that allows a hex or Unicode number range in a sieve regex or am I "stuck" checking the explicit characters?
xyzzy is offline   Reply With Quote

Old 14 Jan 2019, 06:22 AM   #2
gardenweed
Cornerstone of the Community
 
Join Date: Jun 2008
Location: Perth
Posts: 531
Looks interesting.
Can you explain that code that follows the "From"
gardenweed is offline   Reply With Quote
Old 14 Jan 2019, 09:07 AM   #3
xyzzy
Senior Member
 
Join Date: May 2018
Posts: 146
If I was to use the UI to generate an organize rule of the form:

The senderʼs name matches glob pattern abc

then the sieve code FM generates for the test is,

Code:
header :regex "From" "(^|,)[[:space:]]*\"?abc\"?[[:space:]]*<"
it doesn't actually generate a glob match but a regex instead because the actual From header looks something like,

From: abc <[email protected]>

The pattern match for a header sieve command starts after the colon. So starting after the From: the pattern ignores leading spaces or spaces after a comma followed by what to look for optionally enclosed in quotes (if name has spaces of its own), followed by any number of spaces before the email address which starts after the '<'.

Apparently there must be cases where more than one name can be specified and are comma separated. It's the only reason I can think of for looking for commas. Using the UI organize rules avoids a lot of mistakes and saves time constructing these things. I always keep a disabled organize rule laying around just for this purpose.

I took this pattern and replaced the abc portion with a pattern match for one or more (the + sign) Cyrillic characters. In other words [а-яА-ЯЁё]+ or all those characters enumerated.

I would prefer to use the entire Unicode range of Unicode Cyrillic possibilities, i.e., 0x400 to 0x4ff, so that was the reason for my question. I don't know if this is even syntactically possible in sieve regex. Certainly what I tried so far isn't. I posted here hoping someone might know the magic syntax that works if any.

Update:
As I was writing that last paragraph I started thinking about whether there are actually Unicode characters across that entire range for Cyrillic. Looking at a Unicode table for Cyrillic (here) I discovered that there was. So [Ѐ-ӿ]+ should work. Not sure why the web pages I google searched didn't show that. Maybe because others showed the hex range instead. Still want to know if I can do that.

Last edited by xyzzy : 14 Jan 2019 at 09:41 AM.
xyzzy is offline   Reply With Quote
Old 14 Jan 2019, 11:02 AM   #4
gardenweed
Cornerstone of the Community
 
Join Date: Jun 2008
Location: Perth
Posts: 531
Thanks for sharing the methodology.
gardenweed is offline   Reply With Quote
Old 14 Jan 2019, 04:45 PM   #5
BritTim
The "e" in e-mail
 
Join Date: May 2003
Location: mostly in Thailand
Posts: 2,781
Did you include
Code:
require "encoded-character";
at the top of your script?
BritTim is offline   Reply With Quote
Old 14 Jan 2019, 05:39 PM   #6
xyzzy
Senior Member
 
Join Date: May 2018
Posts: 146
Thank you BritTim. Good catch! That was the missing piece of the puzzle. I didn't even notice that line in the example in the spec or the comment a little above about the require encoded-character. With it added this is the range format that appears to work (testing all this with Sieve Tester).

Code:
[${unicode:400}-${unicode:4ff}]+
So I now understand that and in addition learned two other things along the way. First the require command needs to be before any other statements, i.e., at the top as you pointed out. Second, I thought that the require extension list generated by FM was all the sieve extensions FM supported and you couldn't use any others (like encoded-character). Obviously that isn't correct.

Another reason why I assumed you couldn't add others was because Sieve Tester keeps erroring out the fcc extension since it is not implemented in Sieve Tester itself. I wish FM would fix that since I always need to delete that fcc when I copy/paste my script into there. Yes, I submitted a ticket on it some time ago. Sieve Tester is obviously not very high priority since they think not many users actually write Sieve stuff. They're probably right too.

I think the reason Sieve Tester errors out the require fcc but not encoded-character is that encoded-character is part of the base Sieve standard (RFC5228) and I guess must be implemented where fcc is not in the base standard.

Again thanks.
xyzzy is offline   Reply With Quote
Reply


Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump


All times are GMT +9. The time now is 10:46 PM.

 

Copyright EmailDiscussions.com 1998-2013. All Rights Reserved. Privacy Policy