2 Dec 2009 05:12
Re: HTML float style
Professional Software Engineering <PSE-L <at> mail.professional.org>
2009-12-02 04:12:29 GMT
2009-12-02 04:12:29 GMT
At 21:01 2009-12-01 -0500, Eric Wood wrote: >I've been running a similar rule for years. Sometimes a MS Office document >would get caught but that's rare for me. >## Check for emails which have many float: and single char divs >:0 >* -11^0 >* 1^1 ()(float |div> . <) I'm fairly certain you missed a B flag or a B ?? in the condition. I would think you would want to ensure the float was part of a style (sort of like how I'm doing it), and the div should be inclusive of the open tag. Your spaces, BTW will be interpreted as mandatory spaces in there. * 1^1 B ?? ()<div[^>]*>[ ]?.[ ]?< That excluded character class allows the div to contain optional identifiers, such as a style, class, or id. Eliminate the bracketed space+tab if they're actually not wanted. I'll set that up in an analysis filter to watch for how often it matches new mail, but I threw a sizeable spam corpus at the above div recipe (sans float), and I only got single event matches on a few messages (i.e. not enough to overcome the negative prep). I'm suspecting the single character div doesn't occur enough to got *ZERO* hits on it. Oh, and I ran it to count hits on your recipe (sans float) and my revision, and yours doesn't match anything -- because the divs that were matching actually had only a single character between them and the next tag (which wasn't a div closure, BTW): <div align=center> <a href=" Also worth noting, but would be virtually impossible to check for reliably in a procmail recipe, is that the divs on those messages were not balanced - there were more opening divs than closing ones. Spammers can't even craft HTML with proper syntax. What ever has become of the work ethic? A side observation is that with only a couple of exceptions all of the messages that did have any events had the same basic subject line involving pharma and a varying percentage off. This wasn't merely a scatter of hits for one or two days either. I was recently experimenting with something to try to weigh how many short words (really, letter jumbles) there were in a message as compared to longer ones. I've seen a certain amount of spew which has in the text portion a lot of 2-4 character jumbles in a paragraph, with very few longer jumbles. However, they tended to be the text portion of a multipart which included an HTML portion which, surprise, used float... --- Sean B. Straw / Professional Software Engineering Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html> Please DO NOT carbon me on list replies. I'll get my copy from the list.