EmailDiscussions.com  

Go Back   EmailDiscussions.com > Discussions about Email Services > Email Help Needed!
Register FAQ Members List Calendar Search Today's Posts Mark Forums Read
Stay in touch wirelessly

Email Help Needed! Having problems with your email service, or with the email software you're using? Post your questions and answers here!

Reply
 
Thread Tools
Old 16 Jun 2025, 04:28 AM   #1
rscaramelo
Senior Member
 
Join Date: Jan 2017
Posts: 129
duplicate emails

In migrating from multiple services to one I settled on, I have accumulated a massive number of duplicate emails that I want to delete. I have always used Thunderbird with an available extension however this is now constantly locking up. I think it's because of the sheer size of this. I have about 300000 emails and if I were to guess I would be down to under 50K if I got rid of the dupes. Can you suggest what I could use to clean this mess?
rscaramelo is offline   Reply With Quote

Old 16 Jun 2025, 07:07 AM   #2
JeremyNicoll
Cornerstone of the Community
 
Join Date: Dec 2017
Location: Scotland
Posts: 605
Why do you have duplicates in the first place?

Are they eg "log copies" of mails you sent to mail-list servers for which you later received a public copy?

Or copies of other people's mails which have been sent to more than one of your email addresses?

Do all these dupes reside on your PC (& hopefully backups), or are they all online somewhere?

Are you competent in any programming languages?


If they are online (accessed via TB but not [also] held in separate files - (either one file per email, or one per folder) on your PC) are you able to export selected subsets of the mails to your PC?

I know if I were tackling this that even if I found some freeware (or paid) tool that claimed to be able to de-dup collections of mails I would want to be certain it did what I wanted. For example your idea of what constitutes a duplicate might not match that of some tool. I'd want to test the process on small collections of mails...

For example, if one type of duplicate is mails sent by someone else to two of your email addresses, while the content of each may be the same many of the headers will not be. Even if both samples are saved to separate files on your PC a file-comparison utility will not find them to be the same.

One of the email clients I used a while ago had an option (for folders which would only contain 'archived' mails/news posts) of logic that threw away most of the more-techy headers, leaving not much more than To/From/Date/Subject. Even 25 years ago that vastly reduced the bulk of data to be kept ... but it also meant that copies (of the same-content mails/news posts) looked much more like each other than before.

If what you are looking at is a manual process, I'd start by creating month & year subsets of all the data .. possibly further divided by subject. There are probably areas of your life where you care much less about the mails than other areas ... where you might be able to throw sets of mails away completely, or where it doesn't matter if some get deleted by accident. I mean ... if you can reduce all these 300k mails to 50k ... 50k is still A LOT. Who (if not you) will ever look at them? Even if you are doing this for yourself only, how many of the 50k do you think you will ever read again?

If you have business/legal reasons to keep some subsets it'd be better to keep ALL of them (after all, storage is cheap these days) than waste time & effort trying to thin them down).


One of my long-term (likely never to be done) programming ideas is to condense long threads - mostly on techy discussions - where far too many people don't properly trim what they quote of prior parts of the discussion. It probably means identifying each participant (sometimes people post under more than one id - eg work & personal addresses) & isolating each paragraph (or in some cases sentence or phrase) used in the whole discussion, storing each of those just once, then for each mail, storing pointers to the people, paragraphs etc along with who said/quoted stuff when. I think it'd be possible to end up with much more concise discussions, as well as follow (or ignore) individual contributors' contributions. If /I/ ever work on condensing my mail/news archives this is the approach I plan to use; duplicate mails (which for me are probably mostly "log copies" vv "public copies") would evaporate because one or other would just redefine already-known paragraphs.
JeremyNicoll is offline   Reply With Quote
Old 16 Jun 2025, 07:56 AM   #3
rscaramelo
Senior Member
 
Join Date: Jan 2017
Posts: 129
The dupes are from me migrating from service to service to service I believe. It's way too many for me to manually do it. I would need an entire day non-stop. There's an extension I have used on Thunderbird that removes them but it is locking up. I have to think it's because of the sheer number of emails? I would like to find an app or another client that can keep the originals and delete all the dupes. This is years of personal email. This is all online.

Quote:
Originally Posted by JeremyNicoll View Post
Why do you have duplicates in the first place?

Are they eg "log copies" of mails you sent to mail-list servers for which you later received a public copy?

Or copies of other people's mails which have been sent to more than one of your email addresses?

Do all these dupes reside on your PC (& hopefully backups), or are they all online somewhere?

Are you competent in any programming languages?


If they are online (accessed via TB but not [also] held in separate files - (either one file per email, or one per folder) on your PC) are you able to export selected subsets of the mails to your PC?

I know if I were tackling this that even if I found some freeware (or paid) tool that claimed to be able to de-dup collections of mails I would want to be certain it did what I wanted. For example your idea of what constitutes a duplicate might not match that of some tool. I'd want to test the process on small collections of mails...

For example, if one type of duplicate is mails sent by someone else to two of your email addresses, while the content of each may be the same many of the headers will not be. Even if both samples are saved to separate files on your PC a file-comparison utility will not find them to be the same.

One of the email clients I used a while ago had an option (for folders which would only contain 'archived' mails/news posts) of logic that threw away most of the more-techy headers, leaving not much more than To/From/Date/Subject. Even 25 years ago that vastly reduced the bulk of data to be kept ... but it also meant that copies (of the same-content mails/news posts) looked much more like each other than before.

If what you are looking at is a manual process, I'd start by creating month & year subsets of all the data .. possibly further divided by subject. There are probably areas of your life where you care much less about the mails than other areas ... where you might be able to throw sets of mails away completely, or where it doesn't matter if some get deleted by accident. I mean ... if you can reduce all these 300k mails to 50k ... 50k is still A LOT. Who (if not you) will ever look at them? Even if you are doing this for yourself only, how many of the 50k do you think you will ever read again?

If you have business/legal reasons to keep some subsets it'd be better to keep ALL of them (after all, storage is cheap these days) than waste time & effort trying to thin them down).


One of my long-term (likely never to be done) programming ideas is to condense long threads - mostly on techy discussions - where far too many people don't properly trim what they quote of prior parts of the discussion. It probably means identifying each participant (sometimes people post under more than one id - eg work & personal addresses) & isolating each paragraph (or in some cases sentence or phrase) used in the whole discussion, storing each of those just once, then for each mail, storing pointers to the people, paragraphs etc along with who said/quoted stuff when. I think it'd be possible to end up with much more concise discussions, as well as follow (or ignore) individual contributors' contributions. If /I/ ever work on condensing my mail/news archives this is the approach I plan to use; duplicate mails (which for me are probably mostly "log copies" vv "public copies") would evaporate because one or other would just redefine already-known paragraphs.
rscaramelo is offline   Reply With Quote
Old 17 Jun 2025, 02:45 AM   #4
JeremyNicoll
Cornerstone of the Community
 
Join Date: Dec 2017
Location: Scotland
Posts: 605
Quote:
Originally Posted by rscaramelo View Post
The dupes are from me migrating from service to service to service I believe.
I can see that during a migration you might export mails from the old service & later import them at the new one .. but how does that result in duplicates? Wouldn't you just have moved the online copy from one place to another?


Quote:
Originally Posted by rscaramelo View Post
It's way too many for me to manually do it. I would need an entire day non-stop.
You'd need far longer than that.



Quote:
Originally Posted by rscaramelo View Post
There's an extension I have used on Thunderbird that removes them but it is locking up. I have to think it's because of the sheer number of emails?
Probably. If the extension is not capable of working with a subset of your emails - everything in just one folder, or everything for a specific month, then the only other way to make that extension work would be temporarily to export large amounts of mail from the online place & keep them elsewhere (& preferably more than one elsewhere) thean let the extension thin down the stuff still online. Then juggle what is online & what is not.


Quote:
Originally Posted by rscaramelo View Post
I would like to find an app or another client that can keep the originals and delete all the dupes. This is years of personal email. This is all online.

Even if the end position is to have your simplified mail collection still online, I think you may have to run the simplifying process offline.


I think it's possible to run more than one instance of TB, especially if those not in use for the online mails are each (if several) an instance of "Portable Thunderbird" - designed to be run entirely on eg a USB stick, but it could be made to use a PC's hard disk or SSD. See: https://portableapps.com/apps/intern...rbird_portable

Over the years I've used quite a lot of these Portable apps; one of whose advantages is that they don't need a full/proper (registry-affecting) install.

Using any client (with mails on local storage, ie not online) is only a good idea if you do regularly back-up your files. Presumably at present you rely on wherever your online mails are, to keep them backed-up? It's been years since I used a client with files on my PC(s), but I synced the DAILY backups of all the mail files across three machines so I could run the client from a restored backup on either of the secondary machines.

Using one or more instances of Portable TB would have the advantage that (if you're a long-term user of TB) you wouldn't have to learn a whole new client. But it has an obvious disadvantage that you'd need not to get confused about what each one was for. Ways of doing that might be to set up different colour schemes in each, or name at least the parent folders (if not all of them) differently (maybe just eg a 'system' prefix).
JeremyNicoll is offline   Reply With Quote
Reply


Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump


All times are GMT +9. The time now is 11:39 AM.

 

Copyright EmailDiscussions.com 1998-2022. All Rights Reserved. Privacy Policy