EmailDiscussions.com

EmailDiscussions.com (http://www.emaildiscussions.com/index.php)
-   FastMail Forum (http://www.emaildiscussions.com/forumdisplay.php?f=27)
-   -   eml2mbox.sh (http://www.emaildiscussions.com/showthread.php?t=12499)

11 Apr 2003 01:31 AM

eml2mbox.sh
 
Howdy mbox lovers!

After a relatively brief argument/discussion with Rob I decided to write an eml to mbox conversion utility in shell, as some kind of expression of my masochistic tendencies; and because Rob wouldn't do it (as part of the Fastmail system)---although he has been extremely helpful in assisting me with ideas and code. Not least of which is his suggestion of doing this Perl, which probably would be about a third as many lines of code or less!

As I have only just written this over the last few days it hasn't been tested extensively and could easily do with a few speedups---at the expense of clarity perhaps. Yet for it's immaturity, it is pretty functional and robust. It is not RFC-822 perfect in at least in one respect, but one of my MTAs is reasonably happy with the result.

Apart from it not being perfectly RFC-822 groovy, it is a little slow as it will only process about 4-5MB per minute on my PIII/667.

Suggestions welcome.


thoran

P.S. One might normally find me as ID thoran, but I can't remember my password.

P.P.S. Expect a perl version some time in the next few weeks. In the meantime you'll just have to wait an extra minute or two per folder/mailbox conversion.

P.P.P.S. I didn't realise that there was a Ruby version until just now. I saw the more recent discussion, but not the one from a couple of months ago until I just a did a search for the more recent one. I might have used the Ruby one otherwise. Oh well, here's a more ubiquitous one anyhow.

P.P.P.P.S. The script follows because otherwise this message is too long.

11 Apr 2003 01:32 AM

#!/bin/sh
# eml2mbox
# by thoran@thoran.com

# Date: 20030410
# Version Number:
Version='0/4/4'

# Description: eml2mbox takes a directory of retarded eml/Windows formatted email files (such as those from Fastmail) and writes an mbox (RFC-822) file for use with unix. The conversion may be destructive or non-destructive and the mbox can be placed elsewhere than the eml files directory.

# Features: This does fairly comprehensive error checking: 'Does the source directory exist?', 'Does the destination directory exist?', 'Are there any .eml files in the source directory?', and 'Is there a pre-existing mbox file in the destination directory?'. Additionally it will engage in a dialogue so as to authorise an overwrite of the existant mbox. It is robust enough that if there are no Date or From headers in an email it will be able to continue.

# Discussion: Yes, I am aware that the plurality of the program's name is incorrect. I thought it sounded better wrong. Variable substition is circumlocutious, however the code is easier to read as the status of a variable (strictly a collection of variables) can be tracked by name changes. Similarly, the checkDirectoryExists function attempts readability, particularly for those not familiar with shell, but with another procedural language. And I know getopts exists, but it doesn't work for me for some reason.

# Acknowledgements: Rob Mueller of Fastmail for the original couple of lines (AKA the non-anally retentive version), a nice reference to RFC-822 formatting (http://www.qmail.org/man/man5/mbox.html), and for not having Fastmail just do this anyway! Thanks also to Era Eriksson <era@iki.fi> for the majority of the sed code which does the extraction of emails from the 'From:' header and for a nice explication thereof.

# Bugs: If a date header does not have a leading 0 for single digit dates (1st, 2nd, ..., 9th of a month), then the From_ separator is not strictly correct at 24 characters. Most MTAs do provide dates with the leading zero. The fix is to check whether that particular (in awkish terms) field is a single character and if so to prefix it with a zero.

# Licence: Umm a licence, a licence, so many to choose. Well I suppose I ought to start with copyright of this any other previous versions which I didn't put this notice on directly, and any future versions in case I start to lose my brain. Then I should move on to copyleft so as to be balanced. I think that says that notices must be left in tact. Well, anyway, every line with a # at the start has to be left in. Any modifications have to be sent to me---which I think is consistent with copyleft, rather than merely publishing as copyleft. Let me know if this is wrong. Either way, I still want revisions...

# The Usual Legal Stuff About Disclaimers Etcetera (AKA The Non-Caveat-Emptor (The Non-Buyer Beware)): While this script may claim to do something, it's a lie, so if you attempt to use it for its stated purpose and it does what you expect then you're doing well. If however, this script for any reason ****s up or ****s something up, or does absolutely nothing for you or anyone or anything else, for any reason, or you are otherwise unsatisfied, upset, annoyed, ****** (off), angry, litigious, vengeful, bummed, or just generally displeased with it and/or me, then it's your fault for using it. Furthermore, it is a precondition of use of this script that you have read it and understand how it works, so as you can determine if it will do what you want, and not merely what my lies about its function might claim. (There is nothing extraordinary about this disclaimer, except for the bit about reading the source maybe. Every software vendor has something very much like the above. They just take a lot more words and still don't make it quite so plain.)

showVersion()
{
echo
echo " eml2mbox $Version."
echo
}

showHelp()
{
echo
echo 'eml2mbox converts a directory of eml files to an mbox. Source and'
echo 'destination directories default to the current directory.'
echo
echo 'Simple usage:'
echo 'eml2mbox'
echo 'eml2mbox -s <path to source directory---the one containing the eml files>'
echo 'eml2mbox -d <path to destination directory---where the mbox is going to>'
echo 'eml2mbox -r <remove all .eml files from the source directory>'
echo
echo 'Usage with all options:'
echo 'eml2mbox ['
echo ' [ -s | --source ]'
echo ' [ -d | --destination ]'
echo ' [ -r | --remove ]'
echo ' |'
echo ' [ -? | --help ]'
echo ' |'
echo ' [ -V | --version ]'
echo ' ].'
echo
}

parseParameters()
{
sourceDirectory='.'
destinationDirectory='.'
destructiveMode=0
currentOption='NULL'

for currentParameter in $allParameters
do
case $currentOption in
'NULL')
case $currentParameter in
'-s' | '--source')
currentOption='s'
;;
'-d' | '--destination')
currentOption='d'
;;
'-r' | '--remove')
destructiveMode=1
currentOption='NULL'
;;
'-?' | '--help')
showHelp
exit 0
;;
'-V' | '--version')
showVersion
exit 0
;;
*)
echo "Unknown option ($currentParameter). See eml2mbox --help..."
echo
exit 1
;;
esac
;;
's')
sourceDirectory=$currentParameter
currentOption='NULL'
;;
'd')
destinationDirectory=$currentParameter
currentOption='NULL'
;;
esac
done
}

checkDirectoryExists()
{
directoryExists=0
parameterOne=$1
directoryToBeChecked=$parameterOne
if [ -d $directoryToBeChecked ]; then
directoryExists=1
fi
}

ensureDirectoryPathHasATrailingSlash()
{
currentDirectory=`pwd`
pathToPossiblySlashlessDirectory=$1
cd $pathToPossiblySlashlessDirectory
pathToDirectoryWithTrailingSlash=`pwd`'/'
cd $currentDirectory
}

verifyParameters()
{
parametersOK=0

if [ $sourceDirectory = '.' ]; then
sourceDirectoryExists=1
else
checkDirectoryExists $sourceDirectory
if [ $directoryExists = 1 ]; then
sourceDirectoryExists=1
else
sourceDirectoryExists=0
echo 'The source directory does not exist.'
fi
fi

if [ $destinationDirectory = '.' ]; then
destinationDirectoryExists=1
else
checkDirectoryExists $destinationDirectory
if [ $directoryExists = 1 ]; then
destinationDirectoryExists=1
else
destinationDirectoryExists=0
echo 'The destination directory does not exist.'
fi
fi

if [ $sourceDirectoryExists = 1 ]; then
if [ -e "*.eml" ]; then
emlFilesFound=0
echo 'The source directory contains no .eml files.'
else
emlFilesFound=1
fi
fi

ensureDirectoryPathHasATrailingSlash $destinationDirectory
destinationDirectory=$pathToDirectoryWithTrailingSlash
if [ $destinationDirectoryExists = 1 ]; then
path=$destinationDirectory'mbox'
mboxFile=`ls $path`
if [ -z "$mboxFile" ]; then
mboxOK=1
else
echo "This destination directory already contains an mbox file:"
echo "$destinationDirectory."
echo -n 'Do you wish to overwrite the mbox file in this directory? (y,n): '
read dialogueResponse
echo
if [ $dialogueResponse = 'y' ]; then
mboxOK=1
else
mboxOK=0
fi
fi
fi

[ $sourceDirectoryExists = 1 ] &&
[ $destinationDirectoryExists = 1 ] &&
[ $emlFilesFound = 1 ] &&
[ $mboxOK = 1 ] &&
parametersOK=1
}

doIt()
{
processID=$$
ensureDirectoryPathHasATrailingSlash $sourceDirectory
sourceDirectory=$pathToDirectoryWithTrailingSlash
sourceFiles=$sourceDirectory'*.eml'

for currentEmail in $sourceFiles
do
from='fake.address@dotbomb.com'
date='Mon Jan 1 00:00:00 9999'
fromFound=0
dateFound=0
until [ $fromFound = 1 ] && [ $dateFound = 1 ]
do
read line
if [ "$line" = '' ]; then
break
else
firstWord=`echo $line | awk '{print $1}'`
if [ $firstWord = 'From:' ]; then
from=`echo $line | sed -e 's/From: //' -e 's/[ ]*([^)]*)[ ]*//g' -e 's/.*<\([^>]*\)>.*/\1/g'`
fromFound=1
elif [ $firstWord = 'Date:' ]; then
date=`echo $line | sed 's/,//' | awk '{printf("%s %s %s %s %s", $2, $4, $3, $6, $5)}'`
dateFound=1
fi
fi
done < "$currentEmail"
echo "From $from $date" >> $destinationDirectory'mbox.tmp'.$processID
cat "$currentEmail" >> $destinationDirectory'mbox.tmp'.$processID
done

if [ -e "$destinationDirectory'mbox'" ]; then
rm $destinationDirectory'mbox'
mv $destinationDirectory'mbox.tmp'.$processID $destinationDirectory'mbox'
else
mv $destinationDirectory'mbox.tmp'.$processID $destinationDirectory'mbox'
fi
if [ $destructiveMode = 1 ]; then
rm $sourceFiles
fi
}

main()
{
parseParameters
verifyParameters
if [ $parametersOK = 1 ]; then
doIt
else
echo 'No changes performed.'
fi
}

allParameters=$@
main

11 Apr 2003 01:52 AM

Oops!
 
Hello Again,

All the indents have been lost. I'm going to have to do something else by way of posting it here or doing something elsewhere. If someone wants to host this for me then please let me know.


thoran

P.S. I just realised that the destructive mode is crapping out, so here's a quick fix from the next release (just replace this section of code in the previous post):

if [ $destructiveMode = 1 ]; then
currentDirectory=`pwd`
cd $sourceDirectory
rm *.eml
cd $currentDirectory
fi

ady 11 Apr 2003 02:46 AM

I think if you put them inside the code tag, the indent will not lost.

Something like this:
Code:

function something()
{
    do this;
    do that;
}


sjk 11 Apr 2003 06:59 AM

Re: eml2mbox.sh
 
Quote:

Originally posted by thoran2
P.P.P.S. I didn't realise that there was a Ruby version until just now.
Yep: Ruby script eml files -> unix mbox available.

11 Apr 2003 08:28 AM

eml2mbox Release 1, Part 1
 
Code:

#!/bin/sh
# eml2mbox
# by thoran@thoran.com

# Date: 20030411
# Version Number:
Version='1/0/8'

# Description: eml2mbox takes a directory of retarded eml/Windows formatted email files (such as those from Fastmail) and writes an mbox (RFC-822) file for use with unix.  The conversion may be destructive or non-destructive and the mbox can be placed elsewhere than the eml files directory. 

# Features: This does fairly comprehensive error checking: 'Does the source directory exist?', 'Does the destination directory exist?', 'Are there any .eml files in the source directory?', and 'Is there a pre-existing mbox file in the destination directory?'.  Additionally, it will engage in a dialogue so as to authorise an overwrite of an existent mbox.  It is robust enough that if there are no Date or From headers in an email it will be able to continue. 

# Discussion: Yes, I am aware that the plurality of the program's name is incorrect.  I thought it sounded better wrong.  Variable substitution is circumlocutious, however the code is easier to read as the status of a variable (strictly a collection of variables) can be tracked by name changes.  Similarly, the checkDirectoryExists function attempts readability, particularly for those not familiar with shell, but with another procedural language at least, perhaps.  And I know getopts exists, but it doesn't work for me for some reason. 

# Acknowledgements: Rob Mueller of Fastmail for the original couple of lines (AKA the non-anally retentive version), a nice reference to RFC-822 formatting (http://www.qmail.org/man/man5/mbox.html), and for not having Fastmail just do this anyway!  Thanks also to Era Eriksson <era@iki.fi> for the majority of the sed code which does the extraction of email addresses from the 'From:' header and for a nice explication thereof. 

# Bugs: If a date header does not have a leading 0 for single digit dates (1st, 2nd, ..., 9th of a month), then the From_ separator is not strictly correct at 24 characters.  Most MUAs do provide dates with a leading zero where necessary.  The fix is to check whether that particular (in awkish terms) field is a single character and if so to prefix it with a zero. 

# Bugs Fixed Since The Last Release: 1. The rm command was overwhelmed by too many files to remove when the full path for every file was specified; so instead it cds to the source directory and issues an rm *.eml.  2. The logic of the .eml file test in the verifyParameters function was arse-about and wouldn't work anyway once the logic only was corrected; I think I was attempting to modify it to or from the similar test below in the verifyParamters for the mbox test in the destination directory; The two tests are far more congruent now and both work and work similarly---I missed this because even though the test was failing, because the logic was reversed and I had an mbox present after beginning testing I didn't notice until I started with a fresh (no mbox) directory today. 

# Licence: Umm a licence, a licence, so many to choose.  Well I suppose I ought to start with copyright of this any other previous versions which I didn't put this notice on directly, and any future versions in case I start to lose my brain.  Then I should move on to copyleft so as to be balanced.  I think that says that notices must be left intact.  Well, anyway, every line with a # at the start has to be left in.  Any modifications have to be sent to me---which I think is consistent with copyleft, rather than merely publishing as copyleft.  Let me know if this is wrong.  Either way, I still want revisions... 

# The Usual Legal Stuff About Disclaimers Etcetera (AKA The Non-Caveat-Emptor (The Non-Buyer Beware)): While this script may claim to do something, it's a lie, so if you attempt to use it for its stated purpose and it does what you expect then you're doing well.  If however, this script for any reason ****s up or ****s something up, or does absolutely nothing for you or anyone or anything else, for any reason, or you are otherwise unsatisfied, upset, annoyed, ****** (off), angry, litigious, vengeful, bummed, or just generally displeased with it and/or me, then it's your fault for using it.  Furthermore, it is a precondition of use of this script that you have read it and understand how it works, so as you can determine if it will do what you want, and not merely what my lies about its function might claim.  (There is nothing extraordinary about this disclaimer, except for the bit about reading the source maybe.  Every software vendor has something very much like the above.  They just take a lot more words and still don't make it quite so plain.) 

showVersion()
{
        echo
        echo "    eml2mbox $Version."
        echo
}

showHelp()
{
        echo
        echo 'eml2mbox converts a directory of eml files to an mbox.  Source and'
        echo 'destination directories default to the current directory.'
        echo
        echo 'Simple usage:'
        echo 'eml2mbox'
        echo 'eml2mbox -s <path to source directory---the one containing the eml files>'
        echo 'eml2mbox -d <path to destination directory---where the mbox is going to>'
        echo 'eml2mbox -r <remove all .eml files from the source directory>'
        echo
        echo 'Usage with all options:'
        echo 'eml2mbox        ['
        echo '                          [ -s | --source ]'
        echo '                          [ -d | --destination ]'
        echo '                                        [ -r | --remove ]'
        echo '                                |'
        echo '                                        [ -? | --help ]'
        echo '                                |'
        echo '                          [ -V | --version ]'
        echo '                        ].'
        echo
}

parseParameters()
{
        sourceDirectory='.'
        destinationDirectory='.'
        destructiveMode=0
        currentOption='NULL'
       
        for currentParameter in $allParameters
        do
                case $currentOption in
                        'NULL')
                                case $currentParameter in
                                        '-s' | '--source')
                                                currentOption='s'
                                                ;;
                                        '-d' | '--destination')
                                                currentOption='d'
                                                ;;
                                        '-r' | '--remove')
                                                destructiveMode=1
                                                currentOption='NULL'
                                                ;;
                                        '-?' | '--help')
                                                showHelp
                                                exit 0
                                                ;;
                                        '-V' | '--version')
                                                showVersion
                                                exit 0
                                                ;;
                                        *)
                                                echo "Unknown option ($currentParameter).  See eml2mbox --help..."
                                                echo
                                                exit 1
                                                ;;
                                esac
                                ;;
                        's')
                                sourceDirectory=$currentParameter
                                currentOption='NULL'
                                ;;
                        'd')
                                destinationDirectory=$currentParameter
                                currentOption='NULL'
                                ;;
                esac
        done
}

checkDirectoryExists()
{
        directoryExists=0
        parameterOne=$1
        directoryToBeChecked=$parameterOne
        if [ -d $directoryToBeChecked ]; then
                directoryExists=1
        fi
}

ensureDirectoryPathHasATrailingSlash()
{
        currentDirectory=`pwd`
        pathToPossiblySlashlessDirectory=$1
        cd $pathToPossiblySlashlessDirectory
        pathToDirectoryWithTrailingSlash=`pwd`'/'
        cd $currentDirectory
}

verifyParameters()
{
        parametersOK=0
       
        if [ $sourceDirectory = '.' ]; then
                sourceDirectoryExists=1
        else
                checkDirectoryExists $sourceDirectory
                if [ $directoryExists = 1 ]; then
                        sourceDirectoryExists=1
                else
                        sourceDirectoryExists=0
                        echo 'The source directory does not exist.'
                fi
        fi
       
        if [ $destinationDirectory = '.' ]; then
                destinationDirectoryExists=1
        else
                checkDirectoryExists $destinationDirectory
                if [ $directoryExists = 1 ]; then
                        destinationDirectoryExists=1
                else
                        destinationDirectoryExists=0
                        echo 'The destination directory does not exist.'
                fi
        fi
       
        ensureDirectoryPathHasATrailingSlash $sourceDirectory
        sourceDirectory=$pathToDirectoryWithTrailingSlash
        if [ $sourceDirectoryExists = 1 ]; then
                path=$sourceDirectory'*.eml'
                emlFiles=`ls $path`
                if [ -z "$emlFiles" ]; then
                        emlFilesFound=0
                        echo 'The source directory contains no .eml files.'
                else
                        emlFilesFound=1
                fi
        fi
       
        ensureDirectoryPathHasATrailingSlash $destinationDirectory
        destinationDirectory=$pathToDirectoryWithTrailingSlash
        if [ $destinationDirectoryExists = 1 ]; then
                path=$destinationDirectory'mbox'
                mboxFile=`ls $path`
                if [ -z "$mboxFile" ]; then
                        mboxOK=1
                else
                        echo "This destination directory already contains an mbox file:"
                        echo "$destinationDirectory."
                        echo -n 'Do you wish to overwrite the mbox file in this directory? (y,n): '
                        read dialogueResponse
                        echo
                        if [ $dialogueResponse = 'y' ]; then
                                mboxOK=1
                        else
                                mboxOK=0
                        fi
                fi
        fi

        [ $sourceDirectoryExists = 1 ] &&
        [ $destinationDirectoryExists = 1 ] &&
        [ $emlFilesFound = 1 ] &&
        [ $mboxOK = 1 ] &&
        parametersOK=1
}

doIt()
{
        processID=$$
        ensureDirectoryPathHasATrailingSlash $sourceDirectory
        sourceDirectory=$pathToDirectoryWithTrailingSlash
        sourceFiles=$sourceDirectory'*.eml'

        for currentEmail in $sourceFiles
        do
                from='fake.address@dotbomb.com'
                date='Mon Jan 1 00:00:00 9999'
                fromFound=0
                dateFound=0
                until [ $fromFound = 1 ] && [ $dateFound = 1 ]
                do
                        read line
                        if [ "$line" = '' ]; then
                                break
                        else
                                firstWord=`echo $line | awk '{print $1}'`
                                if [ $firstWord = 'From:' ]; then
                                        from=`echo $line | sed  -e 's/From: //' -e 's/[        ]*([^)]*)[        ]*//g' -e 's/.*<\([^>]*\)>.*/\1/g'`
                                        fromFound=1
                                elif [ $firstWord = 'Date:' ]; then
                                        date=`echo $line | sed 's/,//' | awk '{printf("%s %s %s %s %s", $2, $4, $3, $6, $5)}'`
                                        dateFound=1
                                fi
                        fi
                done < "$currentEmail"
                echo "From $from $date" >> $destinationDirectory'mbox.tmp'.$processID
                cat "$currentEmail" >> $destinationDirectory'mbox.tmp'.$processID
        done

        if [ -e "$destinationDirectory'mbox'" ]; then
                rm $destinationDirectory'mbox'
                mv $destinationDirectory'mbox.tmp'.$processID $destinationDirectory'mbox'
        else
                mv $destinationDirectory'mbox.tmp'.$processID $destinationDirectory'mbox'
        fi
        if [ $destructiveMode = 1 ]; then
                currentDirectory=`pwd`
                cd $sourceDirectory
                rm *.eml
                cd $currentDirectory
        fi
}


11 Apr 2003 08:30 AM

eml2mbox Release 1, Part 2/2
 
Code:


main()
{
        parseParameters
        verifyParameters
        if [ $parametersOK = 1 ]; then
                doIt
        else
                echo 'No changes performed.'
        fi
}

allParameters=$@
main


Jeremy Howard 11 Apr 2003 11:49 AM

Great! A couple of pointers for those interested...:

For a super simple version, but which requires procmail and doesn't get dates quite right, use Sjk's one-liner:
Code:

for m in *.eml; do formail < "$m" >> archive.mbox; done
For those interested in running either the shell script version or the procmail version on Windows, you'll need the super-cool Cygwin (which has numerous shells, and procmail): http://sources.redhat.com/cygwin/

thoran 12 Apr 2003 11:17 PM

The Script Is ****ed Up & Revised Speed Estimates
 
Hello,

The script will not work because it is being truncated. For instance the sed code for extracting email addresses is not even half there. I suggest that you email me if you want a working copy, at least until such time as I can post a link...

It looks as if I either got it wrong or Release 1 is a hell of a lot faster than Release 0, but I ran this later release across a 45MB directory and it got through it in just under a minute and a half, so it does about 30MB per minute, not 4 or 5 as I had claimed before. So it ain't quite so slow after all...


thoran

P.S. Even if I did modify the line length, I am not inclined to post it here anymore anyway because of the prudish disposition with respect to ****in' naughty words.

P.P.S. Actually I think it is only the sed code which is truncated. Here it is then:
from=`echo $line | sed -e 's/From: //' -e 's/[ ]*([^)]*)[ ]*//g' -e 's/.*<\([^>]*\)>.*/\1/g'`

thoran 16 Apr 2003 08:04 AM

Release 2 Available
 
Hello,

This version's major enhancement/bug fix is to do a complete rollback at any stage in the conversion process if it is interrupted/terminated exited. I like things robust, so it is.

One unfortunate, though minor, side-effect of this is that some of the exit condition trapping code gets executed several times causing repetitious display of messages. This one is especially frustrating---help?...
Otherwise there's the usual minor cleanups.

Email me for a copy since I don't wish to post it here anymore... It is too long to keep posting, I don't like **** and other lovely words being excised, it is too long to fit into a single post, and I can't think of any other reasons right now.

The next version will try to incorporate the use of standard in and out so as to allow it to be used as part of a chain of utilities. I may not do this since the input is a number of files, rather than a single stream, so piping is not obviously appropriate. I'd have to pipe in a list of files to be converted. Arguments for and against?


thoran

P.S. Look out for fm2mbox.sh, which takes a collection of zipped mail archives and converts the lot to mboxes. It should be available in the next day or so. Note that fm2mbox does rely upon eml2mbox to do each extracted folder conversion. Also note that it also requires having "unzip" available.

Jeremy Howard 16 Apr 2003 10:31 AM

Could you upload it to some free web space somewhere and link to it?

BTW, I'm sorry you're offended by Edwin's policy of ensuring the forums are "family friendly". I know he doesn't mean any offence by putting this in place - he's just trying to make it accessible to as many people as possible. I hope it won't stop you from continuing to post your interesting projects.


All times are GMT +9. The time now is 11:00 PM.


Copyright EmailDiscussions.com 1998-2022. All Rights Reserved. Privacy Policy