Identified = Dirty Links Submitted = Use inside gsa, dummy projects only Unrecognized = Only one Using wildcard matching and extended matching option in gsapi Verified = Global site list, clean list Harvesters = Dirty List
How I use gsapi
Step1 - Monitor Identified Folder and "Harvesters" folder, send to "Submitted" folder, used for gsa dummy projects for testing
Step2 - Remove duplicates from "Submitted" and "Verified"
I just want to cut down processing time etc by deduplicating as gsapi is fast at this on millions of links. I know I will get duplicates somewhere, anyone with some logic chime in if you have ideas. Thanks.
@sengines - You shouldn't be getting that error unless you're trying to dedup a folder that is currently being monitored by a project in Pi. The error is there to let you know you can't remove duplicates from the same folder you're monitoring, it causes conflicts in the software. You can set Pi to remove duplicates from the output , but if its the same folder you have set to be monitored by that same project or another one, it won't work.
You should be able to remove duplicates from your submitted/verified SER folders as long as Pi isn't monitoring them.
Comments
This is why I thought of project linking/chain like senukex. After duplicate removal then process or monitor.
I get an error
"Cannot remove duplicates from the same directory that is being monitored."
Identified = Dirty Links
Submitted = Use inside gsa, dummy projects only
Unrecognized = Only one Using wildcard matching and extended matching option in gsapi
Verified = Global site list, clean list
Harvesters = Dirty List
How I use gsapi
Step1 - Monitor Identified Folder and "Harvesters" folder, send to "Submitted" folder, used for gsa dummy projects for testing
Step2 - Remove duplicates from "Submitted" and "Verified"
I just want to cut down processing time etc by deduplicating as gsapi is fast at this on millions of links. I know I will get duplicates somewhere, anyone with some logic chime in if you have ideas. Thanks.