...making Linux just a little more fun!
Sam Bisbee [sbisbee at computervip.com]
Hey gang,
Here's the deal: I'm trying to delete a message from an mbox with Bash. I have the message number that I got by filtering with `frm` (the message is identified by a header that holds a unique SHA crypt). You've problem guessed by now, but mailutils is fair game.
I don't want to convert from mbox to maildir in /tmp on each run, because it's reasonable that the script would be run every minute. Also, I don't want to put users through that pain with a large mbox.
Also, I really don't want to write a "delete by message number" program in C using the libmailutils program, but I will resort to it if needed.
I saw http://www.argon.org/~roderick/mbox-purge.html, but would like to have "common" packages as dependencies only.
Is there some arg that I missed in `mail`? Should I just try and roll mbox-purge in? All ideas, tricks and release management included, are welcome.
Cheers,
-- Sam Bisbee
Thomas Adam [thomas.adam22 at gmail.com]
On Fri, Feb 05, 2010 at 07:28:52PM -0500, Sam Bisbee wrote:
> Is there some arg that I missed in `mail`? Should I just try and roll > mbox-purge in? All ideas, tricks and release management included, are welcome.
http://www.unix.com/unix-dummies-questio[...]-delete-all-email-messages-one-time.html
Looks promising.
You might also be able to use xmh here as well, I have a book on this somewhere.
-- Thomas Adam
-- "It was the cruelest game I've ever played and it's played inside my head." -- "Hush The Warmth", Gorky's Zygotic Mynci.
Ben Okopnik [ben at linuxgazette.net]
On Fri, Feb 05, 2010 at 07:28:52PM -0500, Samuel Bisbee-vonKaufmann wrote:
> Hey gang, > > Here's the deal: I'm trying to delete a message from an mbox with Bash. I have > the message number that I got by filtering with `frm` (the message is > identified by a header that holds a unique SHA crypt). You've problem guessed > by now, but mailutils is fair game. > > I don't want to convert from mbox to maildir in /tmp on each run, because it's > reasonable that the script would be run every minute. Also, I don't want to put > users through that pain with a large mbox. > > Also, I really don't want to write a "delete by message number" program in C > using the libmailutils program, but I will resort to it if needed.
'formail -s procmail' would be the classic tools for the job, but it sounds like you need something a bit more flexible than procmail (i.e., something that will take an argument and then reject a given message based on that.) Like so (we'll call this script 'reject'):
#!/bin/bash # Created by Ben Okopnik on Fri Feb 5 21:11:54 EST 2010 [ -z "$1" ] && { printf "Usage: ${0##*/} <arg_to_reject>\n"; exit; } tmp=`tempfile`||exit # This will read from STDIN cat>$tmp grep -q "$1" $tmp || cat $tmp rm $tmp
All we need to do then is -
formail -s ./reject '^From: *Joe Smith' < mbox > mbox.out
This should produce an 'mbox.out' that contains all the messages in 'mbox' except those from Joe Smith.
-- * Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *
Ben Okopnik [ben at linuxgazette.net]
On Fri, Feb 05, 2010 at 09:22:52PM -0500, Benjamin Okopnik wrote:
> > All we need to do then is - > > formail -s ./reject '^From: *Joe Smith' < mbox > mbox.out > > This should produce an 'mbox.out' that contains all the messages in > 'mbox' except those from Joe Smith.
I do have to note, though, that this isn't great for large mailboxes with lots of messages; it's not the fastest thing in the world. As a baseline, it takes about 3 seconds to process a 10MB mailbox that has 36 messages in it, but it takes 22 seconds to process one of the same size but with ~600 messages in it. I suppose you could speed it up by sticking the tempfile into memory (assuming you have enough memory), but you're still spawning some interpreter or parser ~600 times, and that ain't cheap.
If there was no requirement to do it with Bash (or, more precisely, mailutils), I would - of course - do it all in Perl, which is famous for its text-processing capabilities. Something like this:
#!/usr/bin/perl -w # Created by Ben Okopnik on Thu Jan 14 21:55:46 EST 2010 use strict; use Mail::MboxParser; $|++; die $0 =~ /([^\/]+)$/, " <mbox> <msgid_value>\n" unless @ARGV == 2; my $mb = Mail::MboxParser->new($ARGV[0]); while (my $msg = $mb->next_message) { my $s = $msg->header->{subject}; print "$msg\n\n" unless defined $s && $s =~ /$ARGV[1]/; }
Less than 2.1 seconds for a 10MB, ~600 message box.
-- * Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *
Sam Bisbee [sbisbee at computervip.com]
On Fri, Feb 05, 2010 at 11:01:37PM -0500, Ben Okopnik wrote:
> On Fri, Feb 05, 2010 at 09:22:52PM -0500, Benjamin Okopnik wrote: > > > > All we need to do then is - > > > > formail -s ./reject '^From: *Joe Smith' < mbox > mbox.out > > > > This should produce an 'mbox.out' that contains all the messages in > > 'mbox' except those from Joe Smith. > > I do have to note, though, that this isn't great for large mailboxes > with lots of messages; it's not the fastest thing in the world. As a > baseline, it takes about 3 seconds to process a 10MB mailbox that has 36 > messages in it, but it takes 22 seconds to process one of the same size > but with ~600 messages in it. I suppose you could speed it up by > sticking the tempfile into memory (assuming you have enough memory), but > you're still spawning some interpreter or parser ~600 times, and that > ain't cheap.
Yeah, coding against "keep everything on one gigantic file" isn't very fun, though it makes administration a lot easier.
> If there was no requirement to do it with Bash (or, more precisely, > mailutils), I would - of course - do it all in Perl, which is famous for > its text-processing capabilities. Something like this:
I was trying to keep my program all in one language, but the Bash solution you provided simply choked and died with large mbox's (ex., currently mine is 365 megs). So, with that and needing to match more than one header pair, I give you this: http://github.com/ravidgemole/mailp/blob/master/deleteMessage.plx
[Don't worry, check the THANKS file to see that you're not forgotten.]
The tests showed much better results. FYI, this test was with a non-sane e-mail so I could make sure the program was doing AND matching, not OR.
sbisbee@orbital:~/src/mailp$ time ./deleteMessage.plx ./mbox to ".*sbisbee@computervip\.c0om.*" x-mailp ad9d8e35e69f9547a9b3c4a8fb06ad0edbe56d9b > test real 0m23.135s user 0m20.065s sys 0m2.760s
Now my main program (a Bash script, though I may convert to Perl for homogeneousness) can remove messages from the mbox that have a specific To address and a certain header key/value pair.
Some more things I want to add:
- An arg to run through the mbox file in reverse, with the theory that people will often want to deal with recent e-mails at the end of the file instead of old ones. Ex., my program would run this command _a lot_ faster if it could combine this arg with the next one...
- An arg to stop running through the mbox file when one match is found. Haven't played with Mail::MboxParser enough yet to know whether I can tell it to just dump the rest of the file's contents.
Thanks,
-- Sam Bisbee
Ben Okopnik [ben at linuxgazette.net]
On Fri, Feb 12, 2010 at 11:48:28PM -0500, Samuel Bisbee-vonKaufmann wrote:
> On Fri, Feb 05, 2010 at 11:01:37PM -0500, Ben Okopnik wrote: > > > > I do have to note, though, that this isn't great for large mailboxes > > with lots of messages; it's not the fastest thing in the world. As a > > baseline, it takes about 3 seconds to process a 10MB mailbox that has 36 > > messages in it, but it takes 22 seconds to process one of the same size > > but with ~600 messages in it. I suppose you could speed it up by > > sticking the tempfile into memory (assuming you have enough memory), but > > you're still spawning some interpreter or parser ~600 times, and that > > ain't cheap. > > Yeah, coding against "keep everything on one gigantic file" isn't very fun, > though it makes administration a lot easier.
But it makes processing a lot slower - and it's an asymptotic curve. In my experience/best judgement, whenever you expose a static data source to multiple users, anything over a meg or so in size is a disaster waiting to happen. At that point, either a database or some sort of a pointer-based index scheme is a requirement.
> I was trying to keep my program all in one language, but the Bash solution you > provided simply choked and died with large mbox's (ex., currently mine is 365 > megs).
Sam, y'know how I said "anything over a couple of meg"? I think 365MB sorta, um, qualifies.
If you just wanted to select and return various emails, there's a bunch of stuff that allows you to do that (e.g., mairix and hyperestraier are stunningly good at what they do.) However, you actually want to delete stuff... in my mind, that pretty much defines it as either a database or a customized caching and indexing solution.
> So, with that and needing to match more than one header pair, I give you > this: http://github.com/ravidgemole/mailp/blob/master/deleteMessage.plx > > [Don't worry, check the THANKS file to see that you're not forgotten.] > > The tests showed much better results. FYI, this test was with a non-sane e-mail > so I could make sure the program was doing AND matching, not OR.
Sure. Do note that Mail::MboxParser allows you to create an index file: take a look at the 'make_index' option in the docs.
> `` > sbisbee@orbital:~/src/mailp$ time ./deleteMessage.plx ./mbox to ".*sbisbee@computervip\.c0om.*" x-mailp ad9d8e35e69f9547a9b3c4a8fb06ad0edbe56d9b > test > > real 0m23.135s > user 0m20.065s > sys 0m2.760s > ''
That's certainly a machine with lots more horsepower than my little netbook - and with lots more memory. In any case, you could speed it up significantly with an index.
> Now my main program (a Bash script, though I may convert to Perl for > homogeneousness) can remove messages from the mbox that have a specific To > address and a certain header key/value pair. > > Some more things I want to add: > > - An arg to run through the mbox file in reverse, with the theory that people > will often want to deal with recent e-mails at the end of the file instead > of old ones. Ex., my program would run this command _a lot_ faster if it > could combine this arg with the next one...
If you invert your index, this would be automatic.
> - An arg to stop running through the mbox file when one match is found. > Haven't played with Mail::MboxParser enough yet to know whether I can tell > it to just dump the rest of the file's contents.
Wouldn't be a problem. The nuclear-powered mechanical dwarves beneath the surface of this module will do the right thing if you only ask them.
-- * Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *
Sam Bisbee [sbisbee at computervip.com]
On Mon, Feb 15, 2010 at 07:21:47PM -0500, Ben Okopnik wrote:
> On Fri, Feb 12, 2010 at 11:48:28PM -0500, Samuel Bisbee-vonKaufmann wrote: > > On Fri, Feb 05, 2010 at 11:01:37PM -0500, Ben Okopnik wrote: > > > > > > I do have to note, though, that this isn't great for large mailboxes > > > with lots of messages; it's not the fastest thing in the world. As a > > > baseline, it takes about 3 seconds to process a 10MB mailbox that has 36 > > > messages in it, but it takes 22 seconds to process one of the same size > > > but with ~600 messages in it. I suppose you could speed it up by > > > sticking the tempfile into memory (assuming you have enough memory), but > > > you're still spawning some interpreter or parser ~600 times, and that > > > ain't cheap. > > > > Yeah, coding against "keep everything on one gigantic file" isn't very fun, > > though it makes administration a lot easier. > > But it makes processing a lot slower - and it's an asymptotic curve.
Thanks for re-stating the first half of my point, but with fancier jargon. :-P
> In my experience/best judgement, whenever you expose a static data source to > multiple users, anything over a meg or so in size is a disaster waiting to > happen. At that point, either a database or some sort of a pointer-based > index scheme is a requirement.
Well, I don't know about the 1 meg part, but I generally agree with you. I really don't want to go the index/database route with this program - though it does give me an idea for another project. I'll most likely recommend that people have fetchmail drop their mailp mail into a stand alone mbox file, set up a different e-mail, etc. if they're hitting performance issues.
Indexes and databases are great, but they'll quickly increase the complexity of my program. Also, adding dependencies will make it harder to port this software; yes, things like [database/caching/index software of choice] is often available, but I have dealt with a lot of hosting environments where it's a political process to get anything new added. Because this is monitoring software, I'm trying to cut down on the number of dependencies so that it can be easily integrated into any environment, whether it's your home desktop or a HIPAA compliant system. This was another reason I wanted to stay away from Perl at the beginning, as the shell is, uh, something of a requirement (side note, will probably de-Bashism mailp once it's done to make it Bourne compliant).
Also, I'm trying to keep this thing all Unix philosophy like.
> > I was trying to keep my program all in one language, but the Bash solution you > > provided simply choked and died with large mbox's (ex., currently mine is 365 > > megs). > > Sam, y'know how I said "anything over a couple of meg"? I think 365MB > sorta, um, qualifies.
Nah. :-P
> If you just wanted to select and return various emails, there's a bunch > of stuff that allows you to do that (e.g., mairix and hyperestraier are > stunningly good at what they do.) However, you actually want to delete > stuff... in my mind, that pretty much defines it as either a database or > a customized caching and indexing solution.
I'll take a look at them regardless, thanks. And yes, I do want to delete for usability reasons (not leaving cruft in folks' mbox files).
> > So, with that and needing to match more than one header pair, I give you > > this: http://github.com/ravidgemole/mailp/blob/master/deleteMessage.plx > > > > [Don't worry, check the THANKS file to see that you're not forgotten.] > > > > The tests showed much better results. FYI, this test was with a non-sane e-mail > > so I could make sure the program was doing AND matching, not OR. > > Sure. Do note that Mail::MboxParser allows you to create an index file: > take a look at the 'make_index' option in the docs.
Ohhh thanks. Will drop this into my "low hanging fruit" category when I start to do "real" benchmarking.
> > `` > > sbisbee@orbital:~/src/mailp$ time ./deleteMessage.plx ./mbox to ".*sbisbee@computervip\.c0om.*" x-mailp ad9d8e35e69f9547a9b3c4a8fb06ad0edbe56d9b > test > > > > real 0m23.135s > > user 0m20.065s > > sys 0m2.760s > > '' > > That's certainly a machine with lots more horsepower than my little > netbook - and with lots more memory.
Yup. 4 gigs RAM and Intel Core 2 Quad 2.4 GHz. Also has mirrored RAID (actual card, not on board) with hard drives with big caches, so the disk I/O portion of that command is "okay" (numbers weren't very different when I just dumped to terminal instead of redirecting).
My work tends to be process intensive, so I had to go "all in" last winter. Side benefit, it plays games like a pro. ;-)
> In any case, you could speed it up significantly with an index. > > > Now my main program (a Bash script, though I may convert to Perl for > > homogeneousness) can remove messages from the mbox that have a specific To > > address and a certain header key/value pair. > > > > Some more things I want to add: > > > > - An arg to run through the mbox file in reverse, with the theory that people > > will often want to deal with recent e-mails at the end of the file instead > > of old ones. Ex., my program would run this command _a lot_ faster if it > > could combine this arg with the next one... > > If you invert your index, this would be automatic.
I was thinking of using MboxParser's rewind feature, but yeah.
> > - An arg to stop running through the mbox file when one match is found. > > Haven't played with Mail::MboxParser enough yet to know whether I can tell > > it to just dump the rest of the file's contents. > > Wouldn't be a problem. The nuclear-powered mechanical dwarves beneath > the surface of this module will do the right thing if you only ask them. >
Hope they aren't unionized!
But these two args would speed things up a lot, especially if I store when I sent the e-mail:
1. Start processing the mbox from bottom to top. 2. Stop processing when... 2a. Found the e-mail and delete it. 2b. Found an e-mail with a time stamp that's older than when we sent the e-mail. 3. Dump the rest of the e-mail into the mbox.
Cheers,
-- Sam Bisbee
Ben Okopnik [ben at linuxgazette.net]
On Tue, Feb 16, 2010 at 02:57:30PM -0500, Samuel Bisbee-vonKaufmann wrote:
> On Mon, Feb 15, 2010 at 07:21:47PM -0500, Ben Okopnik wrote: > > On Fri, Feb 12, 2010 at 11:48:28PM -0500, Samuel Bisbee-vonKaufmann wrote: > > > > > > Yeah, coding against "keep everything on one gigantic file" isn't very fun, > > > though it makes administration a lot easier. > > > > But it makes processing a lot slower - and it's an asymptotic curve. > > Thanks for re-stating the first half of my point, but with fancier jargon. :-P
Aw, hell, Sam - I figgered after you went to thet there fancy Bawston collitch, you fergot how to talk lahk folks, so I copied some stuff outen a book fer ya... great-great-grandpaw got it from a big city liberry jest so all we'uns could see what one looked lahk. Hit's got some uh thet fancy talk in it, so I reckoned hit would suit ya.
(By the way, that there 'puter you sent us shore was complicated. I took hit all apart and couldn't find neether a carbureter nor a pull-cord nowheres, so I give up on it. Dern it, guess I'll hev to mow the hay by hand agin...)
> Indexes and databases are great, but they'll quickly increase the complexity of > my program.
[blink] You and I must mean different things by 'database', then.
#!/bin/bash # Created by Ben Okopnik on Tue Feb 16 19:58:51 EST 2010 [ -z "$1" ] && { printf "Usage: ${0##*/} <hdr_name> <hdr_val> [name val] ... ...\n"; exit 1; } [ "$(($# % 2))" -ne 0 ] && { printf "# of headers != # of values.\n"; exit 1; } sql='delete from emails where' while [ "$#" -ne 0 ] do sql="$sql $1 = '$2'" [ "$#" -gt 2 ] && sql="$sql and" shift; shift done echo "$sql"|/usr/bin/mysql -u user dbname
(You did say you wanted to stick with the shell, right?)
This would be the entire deletion program. It would also beat anything that parsed the file on every pass, speedwise.
> Also, adding dependencies will make it harder to port this > software; yes, things like [database/caching/index software of choice] is often > available, but I have dealt with a lot of hosting environments where it's a > political process to get anything new added.
That's exactly why I mentioned indexing as another option: it requires no other software. You can generate an index file with Mail::MboxParser, then specify its name in the 'new' method - and you're done.
> > Wouldn't be a problem. The nuclear-powered mechanical dwarves beneath > > the surface of this module will do the right thing if you only ask them. > > > > Hope they aren't unionized!
They'd miss their ions terribly if that ever happened... ))
-- * Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *
Sam Bisbee [sbisbee at computervip.com]
On Tue, Feb 16, 2010 at 08:32:11PM -0500, Ben Okopnik wrote:
> On Tue, Feb 16, 2010 at 02:57:30PM -0500, Samuel Bisbee-vonKaufmann wrote: > > On Mon, Feb 15, 2010 at 07:21:47PM -0500, Ben Okopnik wrote: > > > On Fri, Feb 12, 2010 at 11:48:28PM -0500, Samuel Bisbee-vonKaufmann wrote:[snip]
> > Indexes and databases are great, but they'll quickly increase the complexity of > > my program. > > [blink] You and I must mean different things by 'database', then.
[blink blink]
> > ``` > #!/bin/bash > # Created by Ben Okopnik on Tue Feb 16 19:58:51 EST 2010 > > [ -z "$1" ] && { printf "Usage: ${0##*/} <hdr_name> <hdr_val> [name val] ... ...\n"; exit 1; } > [ "$(($# % 2))" -ne 0 ] && { printf "# of headers != # of values.\n"; exit 1; } > > sql='delete from emails where' > while [ "$#" -ne 0 ] > do > sql="$sql $1 = '$2'" > [ "$#" -gt 2 ] && sql="$sql and" > shift; shift > done > echo "$sql"|/usr/bin/mysql -u user dbname > ''' >
[snip]
> This would be the entire deletion program.
Uh, no? You still need to read the mbox into the database and clean the input for SQL injection (or whatever).
> It would also beat anything that parsed the file on every pass, speedwise.
That statement assumes a high number of emails, which is fine but should be noted. Ex., an mbox with one email is going to be more quickly dealt with by the iterative method I've been using that reading it into the db, building the index, and then building and running the deletion query.
I'm bouncing back and forth on how to handle the large mbox file user (like myself): build in support for a database or two, or tell them to simply filter their mailp emails into a different mbox file. I'll most likely punt on the issue, fighting it with the README for now, as I keep coming back to Rob Pike's third rule: "Fancy algorithms are slow when n is small, and n is usually small. Fancy algorithms have big constants. Until you know that n is frequently going to be big, don't get fancy. (Even if n does get big, use Rule 2 first.)" http://www.faqs.org/docs/artu/ch01s06.html
> > Also, adding dependencies will make it harder to port this software; yes, > > things like [database/caching/index software of choice] is often available, > > but I have dealt with a lot of hosting environments where it's a political > > process to get anything new added. > > That's exactly why I mentioned indexing as another option: it requires no > other software. You can generate an index file with Mail::MboxParser, then > specify its name in the 'new' method - and you're done.
I played around with this to no avail. It took Mail::MboxParser longer to build the index than it did for the iterative method to just run like normal (running against 371M mbox).
> > > Wouldn't be a problem. The nuclear-powered mechanical dwarves beneath the > > > surface of this module will do the right thing if you only ask them. > > > > Hope they aren't unionized! > > They'd miss their ions terribly if that ever happened... ))
[Insert joke here about socialism, cows, and capitalists here.]
Cheers,
-- Sam Bisbee
Ben Okopnik [ben at linuxgazette.net]
On Sun, Feb 21, 2010 at 07:07:14PM -0500, Samuel Bisbee-vonKaufmann wrote:
> On Tue, Feb 16, 2010 at 08:32:11PM -0500, Ben Okopnik wrote:[snip]
> > This would be the entire deletion program. > > Uh, no? You still need to read the mbox into the database and clean the > input for SQL injection (or whatever).
Sam, that's a red herring. Notice how I said "deletion program", not "parsing program"? In addition, it's ridiculous to talk about SQL injection when you're giving a deletion tool to your users. Why would they care about SQL injection when they can just delete the entire box via the normal, defined use of your program?
> > It would also beat anything that parsed the file on every pass, speedwise. > > That statement assumes a high number of emails, which is fine but should be > noted. Ex., an mbox with one email is going to be more quickly dealt with by > the iterative method I've been using that reading it into the db, building the > index, and then building and running the deletion query.
You don't build an index with a DB, Sam - you just load up the data - and you don't get to count "building and running the deletion query" against this approach, since you have to do those things with any approach.
> I'm bouncing back and forth on how to handle the large mbox file user (like > myself): build in support for a database or two, or tell them to simply filter > their mailp emails into a different mbox file. I'll most likely punt on the > issue, fighting it with the README for now, as I keep coming back to Rob Pike's > third rule: "Fancy algorithms are slow when n is small, and n is usually small. > Fancy algorithms have big constants. Until you know that n is frequently going > to be big, don't get fancy. (Even if n does get big, use Rule 2 first.)" > http://www.faqs.org/docs/artu/ch01s06.html
Half a dozen lines of shell scripting is a fancy algorithm? Wow. You should look at some bio-informatics or mathematical puzzle algorithms, just to clear your mind. Oh, and to address your question about loading the emails into a DB: that's no more difficult than putting them into your mailbox in the first place. Tell procmail to pipe the appropriate emails to a script; have the script split the email into all the headers and the body ('formail' will do that quite handily), and feed them to the DB as an INSERT statement. The solution is trivial, and is left up to the individual student.
> > > > Wouldn't be a problem. The nuclear-powered mechanical dwarves beneath the > > > > surface of this module will do the right thing if you only ask them. > > > > > > Hope they aren't unionized! > > > > They'd miss their ions terribly if that ever happened... )) > > [Insert joke here about socialism, cows, and capitalists here.]
Yes, but it only works with spherical cows.
-- * Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *