Clarification on [RD] Projects default behavior - is this correct?

Diji1 · November 2015

I'm still getting my head around how this amazing tool works and I ended up with a whole bunch of duplicates in my PI identified files due to misunderstanding on my part. All good and so on ...

But as a result of this I decided to "start over" so to speak so I chucked all my scraped urls txt files into a folder and added them all to 1 new project to consolidate all my links. I have the "add a remove duplicates project" setting turned on and so I have two projects after make this new project, Consolidate Links and [RD] Consolidate Links and then I selected both and started them.

I noticed that the [RD] Consolidate Links has status of working and an ETA time but it says 0 threads are being used and all url counts are 0 still while the other project is categorising as per normal.

Am I correct in assuming that the Consolidate Links project is processing urls that have yet to be deduped (and hence is "wasting" cpu time in the sense that it could have avoided identifying it) -

OR

is the [RD] Consolidate Links project deduping links before processing in this situation?

s4nt0s · November 2015

When you created the project, the option "automatically create project to remove duplicates" must have been selected.

That project by default will dedup the output folder of the original project at an interval of 15 minutes. So you wont see any activity in the project until 15 minutes passes, then it will start to dedup that folder, then reset and try again in another 15 minutes. You can click on the dup remove project to edit the dup check interval.

It's just an auto dup remove project, but you don't have to run it, its up to you.

You won't see any threads or anything like that until the Time left column runs out.

Diji1 · November 2015

Ah cool now I understand - so by far the better way to do it would be to use

Tools -> Remove duplicate URLs first,

then run 1 project with all those URLs without an [RD] project until the end (considering I'll have no other projects running or links being monitored)

THEN

Make a monitor project with an [RD] project to add in any future URLs.

I like it

Thanks!

Edited: GSA PI just wiped the floor with GScraper which was my previous "go to" application for deduping without pulling out textpipe, cream or emeditor.

I had ~158M urls, 106M unique in 6.1GB of files, largest file is 1.48M and 6GB is covered by the first 57 files of which the smallest is 38MB.

GScraper took about 2 hours and slowing

to get to ~50%. Issue was it was using 32GB of RAM in my 32GB server

Eventually terminated the process.

Opened same files in GSA PI -> Tools -> remove duplicate urls and it's been running for about ... 15 minutes? 50M urls in deduped folder already. Under 50MB RAM use. No CPU on Task Monitor systray icon. Happy customer right here

Clarification on [RD] Projects default behavior - is this correct?

Comments