Clarification on [RD] Projects default behavior - is this correct?
I'm still getting my head around how this amazing tool works and I ended up with a whole bunch of duplicates in my PI identified files due to misunderstanding on my part. All good and so on ...
But as a result of this I decided to "start over" so to speak so I chucked all my scraped urls txt files into a folder and added them all to 1 new project to consolidate all my links. I have the "add a remove duplicates project" setting turned on and so I have two projects after make this new project, Consolidate Links and [RD] Consolidate Links and then I selected both and started them.
I noticed that the [RD] Consolidate Links has status of working and an ETA time but it says 0 threads are being used and all url counts are 0 still while the other project is categorising as per normal.
Am I correct in assuming that the Consolidate Links project is processing urls that have yet to be deduped (and hence is "wasting" cpu time in the sense that it could have avoided identifying it) -
OR
is the [RD] Consolidate Links project deduping links before processing in this situation?
But as a result of this I decided to "start over" so to speak so I chucked all my scraped urls txt files into a folder and added them all to 1 new project to consolidate all my links. I have the "add a remove duplicates project" setting turned on and so I have two projects after make this new project, Consolidate Links and [RD] Consolidate Links and then I selected both and started them.
I noticed that the [RD] Consolidate Links has status of working and an ETA time but it says 0 threads are being used and all url counts are 0 still while the other project is categorising as per normal.
Am I correct in assuming that the Consolidate Links project is processing urls that have yet to be deduped (and hence is "wasting" cpu time in the sense that it could have avoided identifying it) -
OR
is the [RD] Consolidate Links project deduping links before processing in this situation?
Comments
Tools -> Remove duplicate URLs first,
then run 1 project with all those URLs without an [RD] project until the end (considering I'll have no other projects running or links being monitored)
THEN
Make a monitor project with an [RD] project to add in any future URLs.
I like it Thanks!
Edited: GSA PI just wiped the floor with GScraper which was my previous "go to" application for deduping without pulling out textpipe, cream or emeditor.
I had ~158M urls, 106M unique in 6.1GB of files, largest file is 1.48M and 6GB is covered by the first 57 files of which the smallest is 38MB.
GScraper took about 2 hours and slowing to get to ~50%. Issue was it was using 32GB of RAM in my 32GB server Eventually terminated the process.
Opened same files in GSA PI -> Tools -> remove duplicate urls and it's been running for about ... 15 minutes? 50M urls in deduped folder already. Under 50MB RAM use. No CPU on Task Monitor systray icon. Happy customer right here