filter by Domain

useruser1 · April 2015

I have an idea - when I give a fresh list of urls to PI and PI found engine(or engines), do not need to check again the same domains in this list(let it passes all these domains)

s4nt0s · April 2015

Do you mean deduping while its sorting? If so, we can't do it like this, it would increase CPU/mem usage quite a bit.

We, added the dedup domains to the latest version today, but that has to be its own dedup project.

useruser1 · April 2015

no, I mean removing while check

s4nt0s · April 2015

Deduping while check as in deduping while its identifying links? I'm not sure what you mean exactly?

useruser1 · April 2015

example list of urls:
somedomain14.com/somepath/somefile.html
somedomain32.com/somepath/somefile.html
somedomain14.com/somepath/somefile2.html ...

when PI identified somedomain14.com, no need to check somedomain14.com/somepath/somefile2.html

s4nt0s · April 2015

Ahh, I see what you're saying. Well, the problem with that is we would need to do an extra check on every URL that it sorts. So a URL comes in, it has to check the identified URLS to see if that URL has been sorted before, which also means as the identified list grows, it's having to compare each URL against big .txt files.

It would slow things down, use more resources, etc.

Best thing for you to do would be to setup an automatic dedup project and have it automatically remove dup domains.

filter by Domain

Comments