DropBox "Conflicted Copy" Handling
Hi Sven, long-term power-user here, I was not sure which section to put this suggestion into, it would fit in either Platform identifier or SER, the solution could be applied to both.
I have on occasions up to 5 SER instances and 3 Platform Identifier instances running on multiple servers, I use DropBox to sync my files from server to server as they are processed. I know I am one of many that use this basic but highly effective set-up.
No matter how fast your servers, due to the latency of DropBox "Conflicted copies" are created at scale on files larger than 1mb (I guess dependant to your setup) and this can end up making a huge mess, as example, my identified folder has almost 6000 files due to this.
Request:
Platform Identifier Feature
Automated "hands-free" Dedupe and merge of conflicted files so:
on noticing conflicted files:
Example------------------------------------------------
sitelist_Article-AltoCMS-LiveStreet (swedishpowerhouse's conflicted copy 2018-06-04 (1)).txt
sitelist_Article-AltoCMS-LiveStreet (swedishpowerhouse's conflicted copy 2018-06-04 (2)).txt
sitelist_Article-AltoCMS-LiveStreet (swedishpowerhouse's conflicted copy 2018-06-04 (3)).txt
sitelist_Article-AltoCMS-LiveStreet (swedishpowerhouse's conflicted copy 2018-06-04 (4)).txt
sitelist_Article-AltoCMS-LiveStreet (swedishpowerhouse's conflicted copy 2018-06-04).txt
sitelist_Article-AltoCMS-LiveStreet (swedishpowerhouse's conflicted copy 2018-06-08 (1)).txt
sitelist_Article-AltoCMS-LiveStreet (swedishpowerhouse's conflicted copy 2018-06-08 (2)).txt
sitelist_Article-AltoCMS-LiveStreet (swedishpowerhouse's conflicted copy 2018-06-08 (3)).txt
sitelist_Article-AltoCMS-LiveStreet (swedishpowerhouse's conflicted copy 2018-06-08 (5)).txt
sitelist_Article-AltoCMS-LiveStreet (swedishpowerhouse's conflicted copy 2018-06-08).txt
sitelist_Article-AltoCMS-LiveStreet.txt
/Example-----------------------------------------------
All documents are automatically deduped, merged to original file:
sitelist_Article-AltoCMS-LiveStreet.txt
and then finally the now null conflicted copies deleted.
Request:
SER Feature
It makes no difference to me which program handles this process but I wanted to point out how botmasters HREFER handles dedupe. In HREFER you have the option to deduped the parsed files on launch.
Perhaps if SER did the simple dedupe/merge/delete as described in the platform identifier request on boot, this would sort the issue and help keep everyone's GSA directories streamlined.
Right now I have quick and dirty solutions via PI, but I still have to pause all my servers once a week to sort this else it kills dropbox.
Keep up the great work, would be great if you can implement the above but in no way a deal breaker
I have on occasions up to 5 SER instances and 3 Platform Identifier instances running on multiple servers, I use DropBox to sync my files from server to server as they are processed. I know I am one of many that use this basic but highly effective set-up.
No matter how fast your servers, due to the latency of DropBox "Conflicted copies" are created at scale on files larger than 1mb (I guess dependant to your setup) and this can end up making a huge mess, as example, my identified folder has almost 6000 files due to this.
Request:
Platform Identifier Feature
Automated "hands-free" Dedupe and merge of conflicted files so:
on noticing conflicted files:
Example------------------------------------------------
sitelist_Article-AltoCMS-LiveStreet (swedishpowerhouse's conflicted copy 2018-06-04 (1)).txt
sitelist_Article-AltoCMS-LiveStreet (swedishpowerhouse's conflicted copy 2018-06-04 (2)).txt
sitelist_Article-AltoCMS-LiveStreet (swedishpowerhouse's conflicted copy 2018-06-04 (3)).txt
sitelist_Article-AltoCMS-LiveStreet (swedishpowerhouse's conflicted copy 2018-06-04 (4)).txt
sitelist_Article-AltoCMS-LiveStreet (swedishpowerhouse's conflicted copy 2018-06-04).txt
sitelist_Article-AltoCMS-LiveStreet (swedishpowerhouse's conflicted copy 2018-06-08 (1)).txt
sitelist_Article-AltoCMS-LiveStreet (swedishpowerhouse's conflicted copy 2018-06-08 (2)).txt
sitelist_Article-AltoCMS-LiveStreet (swedishpowerhouse's conflicted copy 2018-06-08 (3)).txt
sitelist_Article-AltoCMS-LiveStreet (swedishpowerhouse's conflicted copy 2018-06-08 (5)).txt
sitelist_Article-AltoCMS-LiveStreet (swedishpowerhouse's conflicted copy 2018-06-08).txt
sitelist_Article-AltoCMS-LiveStreet.txt
/Example-----------------------------------------------
All documents are automatically deduped, merged to original file:
sitelist_Article-AltoCMS-LiveStreet.txt
and then finally the now null conflicted copies deleted.
Request:
SER Feature
It makes no difference to me which program handles this process but I wanted to point out how botmasters HREFER handles dedupe. In HREFER you have the option to deduped the parsed files on launch.
Perhaps if SER did the simple dedupe/merge/delete as described in the platform identifier request on boot, this would sort the issue and help keep everyone's GSA directories streamlined.
Right now I have quick and dirty solutions via PI, but I still have to pause all my servers once a week to sort this else it kills dropbox.
Keep up the great work, would be great if you can implement the above but in no way a deal breaker
Comments
I used to have a problem like that when i was checking and sorting the list from my home pc with a 200 \ 100 mbs line
But since i moved the GSA Platform identified to a dedicated VPS, i dont have anymore conflict issues, and my identified unique list is 100,286,611 urls or just over 2GB
I did however noticed that sometimes GSA Platform identified is delaying saving the urls it sorted.So if you stop or restart GSA Platform Identifier make sure you wait at least 15 min before doing the dup removals and list clean up because GSA will wait as long as the set delay time to save what it has identified and sorted, before it save. In my case i have set it to save every 600 sec, which is 10 min ( the max time u can set.
Before you run the remove dup urls in GSA SER or GSA PI or custom tool, as in my case, make sure there are no Dropbox files being updated, when all dropbox files are synced then exit Dropbox, and then proceed to do the list clean up. When it is done, restart Dropbox and let it sync the updated list.
On the rare occasion that you do get a couple of conflict, then simply open the main folder where your urls are saved, then search within the folder for the term "conflict" and then select all and cut them, go paste them in a different folder, i just create a folder called conflict, and past all the conflict files in there.
No use the GSA merge tool, and merge them all into one file, and when done, remove dup urls. Then run only those thru platform identifier again.
It sounds like allot of work, but it takes less than 5 min to do.
Thanks @Sven