Pulling Mail Out of Gmail And Retaining Your Labels

GmailIf you are fed up with Gmail and want to pull all your mail, here is how you do it. This technique was used on over 30 mail accounts so I’m sure it will work for you.

The problem of exporting your mail from Gmail is not a trivial one. From discussions by Opera Software’s lead QA for Opera Mail’s posting on Gmail’s Buggy IMAP Implementation to Matt Cutts’ posting on How to back up your Gmail on Linux in four easy steps to LifeHacker’s posting on Back up Gmail on Linux with Getmail to Wired’s wiki entry on Make a Local Backup Of Your Gmail Account, it seems that there is no single definitive source on how to pull your mail and retain your labels.

So here is what I’ve done to solve this problem:

  1. Use getmail – this has been the best archiver I’ve run across. There are other applications – isync, OfflineIMAP, Fetchmail, etc. – that probably do a decent job, but getmail is still the best in my view. There are other hacks – use Mail.app to synch the Gmail IMAP directory, then convert emlx to maildir; same for Thunderbird and mbox; etc – but we wanted something a little more straightforward – Occam’s razor, right?
  2. Install getmail – On my dev machine, I used macports (port install python25; port install getmail) to install the latest getmail which had dependencies on Python 2.5. After this was done, I set up the getmailrc config file and fired off an attempt using SimpleIMAPSSLRetriever… which failed due to a lack of SSL in the newly installed Python. I had to go back and install Readline (port install py25-readline), then install SSL for Python (port install py25-socket-ssl).
  3. Patch Python – There is a malloc bug in imaplib when fetching large documents using SSL. So open up imaplib.py from your Python lib dir (in my case /opt/local/lib/python2.5/) and replace:
    data = self.sslobj.read(size-read)

    with

    data = self.sslobj.read(min(size-read, 16384))

    to maintain a 15MB memory block if necessary.

  4. Configure getmail – Now that most of the fun is taken care of, we need to set up a configuration file for getmail (~/.getmail/getmailrc) and create the proper local destination. First the getmailrc file:
    [retriever]
    type = SimpleIMAPSSLRetriever
    server = imap.gmail.com
    mailboxes = ("[Gmail]/Starred",)
    username = username@yourdomain.com
    password = xxx
    
    [destination]
    type = Maildir
    path = ~/Maildir/
    
    [options]
    verbose = 2
    message_log = ~/.getmail/gmail.log

    First of all, we are using IMAP to retrieve mail as POP has a limit of 99 documents per access and that would take forever.

    Second, we are using the Maildir format for the destination so we need to make sure the target directories have been created (~/Maildir/cur, ~/Maildir/new, ~/Maildir/tmp).

    Third, we need to specify a mailbox or mailboxes to download or the INBOX will be the default.

    Fourth, we need a trailing comma on the list of mailboxes to download due to a parsing error in getmail (actually the mailboxes option needs to be a tuple, but the trailing comma negates that).

    Fifth, we need to know the syntax of Gmail’s internal IMAP structure to pull down discrete folders. Non-label folders (Starred, Sent Mail, Drafts, etc.) are accessed with “[Gmail]/Starred” (as in the above config) and labels are accessed directly. For example, the label “Important Project” would have this in the config:

    mailboxes = ("Important Project",)
  5. Download your Gmail – For every folder/label I had within Gmail, I downloaded to a separate folder so I could import into dovecot IMAP without hassle. This entailed changing the mailboxes option in getmailrc, running getmail, renaming Maildir to label/directory name, rinsing, repeating.
  6. Retain Times – Because maildir uses the modification time of every file to determine the sent date, all emails pulled by the above method will basically lose their sense of time. The below PHP script will restore the modification times:
/* VARS ***********************************************************/
$box = '';
$stem = SITE_DIR.'Maildir/'.$box.'/new/';
/******************************************************************/
 
$dir_contents = scandir($stem);
foreach($dir_contents as $item) {
  if(!ListFind('.,..,.DS_Store',$item)) {
    $file = $stem.$item;
    $content = file_get_contents($file);
    $date = extractText($content,"nDate: ","n");
    $utime = strtotime($date);
    $converted = date('YmdHi.s',$utime);
    shell_exec('touch -mt '.$converted.' "'.$file.'"');
  }
}
 
function extractText($content,$start,$end) {
  if(strripos($content,$start)===false) { return false; }
  $startpoint = strripos($content,$start)+strlen($start);
  $endpoint = strripos($content,$end,$startpoint);
  $length = $endpoint - $startpoint;
  return trim(substr($content,$startpoint,$length));
}
 
function ListDeleteAt($inList, $inPosition, $inDelim = ',') {
  $aryList = _listFuncs_PrepListAsArray($inList, $inDelim);
  array_splice($aryList, $inPosition-1, 1);
  $outList = join($inDelim, $aryList);
  return $outList;
}
 
function _listFuncs_PrepListAsArray($inList, $inDelim) {
  $inList = trim($inList);
  $inList = preg_replace('/^' . preg_quote($inDelim, '/') . '+/', '', $inList);
  $inList = preg_replace('/' . preg_quote($inDelim, '/') . '+$/', '', $inList);
  $outArray = preg_split('/' . preg_quote($inDelim, '/') . '+/', $inList);
  if(sizeof($outArray) == 1 && $outArray[0] == '') {
    $outArray = array();
  }
  return $outArray;
}

photo: chris ivarson

This is a reprint of a post I originally made at http://www.propertymaps.com/blog. I felt it was relevant to the current Gmail posts so am reprinting with slight modifications.

Hey, like this post? Why not share it with a buddy?

Leave a Comment