Remove duplicates from MongoDB

As of version 2.x, MongoDB dropped support for dropDups due to it's dangerous nature of not knowing which item to remove (We don't want to break the dependency chain do we?)

Given the simple object of a Sports Team

{
  "name": "Knights",
  "city": "Los Angeles",
  "state": "CA"
}

We want to remove all duplicates that have the same combination of name, city and state. To view all duplicates of this combination, run a map reduce in MongoDB:

db.getCollection('teams').aggregate(  
    { $match: { 
        name: { $ne: ''},
        city: { $ne: ''},
        state: { $ne: ''}
    }},
    { $group: {
        _id: { name: "$name", city: "$city", state: "$state"},
        count: { $sum: 1},
        dups: { $push: "$_id"}
    }},
    { $match: {
        count: { $gt: 1}
    }}
)

The results should show if there are any duplicate combinations.

To remove the duplicates run:

var duplicates = [];

db.getCollection('teams').aggregate([  
  { $match: { 
      name: { $ne: ''},
      city: { $ne: ''},
      state: { $ne: ''}
  }},
  { $group: { 
      _id: { name: "$name", city: "$city", state: "$state"},
      count: { $sum: 1},
      dups: { $push: "$_id"}, 

  }}, 
  { $match: { 
      count: { $gt: 1}
  }}
])               
.forEach(function(doc) {
    doc.dups.shift();      
    doc.dups.forEach( function(dupId){ 
        duplicates.push(dupId);
        }
    )    
})


db.getCollection('teams').remove({_id:{$in:duplicates}})  

A caution to note, this script does not check the dependencies before deleting. So use at your own risk.

Comments powered by Disqus