filter by Domain

I have an idea - when I give a fresh list of urls to PI and PI found engine(or engines), do not need to check again the same domains in this list(let it passes all these domains)

Comments

  • s4nt0ss4nt0s Houston, Texas
    Do you mean deduping while its sorting? If so, we can't do it like this, it would increase CPU/mem usage quite a bit. 

    We, added the dedup domains to the latest version today, but that has to be its own dedup project.
  • no, I mean removing while check
  • s4nt0ss4nt0s Houston, Texas
    Deduping while check as in deduping while its identifying links? I'm not sure what you mean exactly?
  • example list of urls:
    somedomain14.com/somepath/somefile.html
    somedomain32.com/somepath/somefile.html
    somedomain14.com/somepath/somefile2.html ...

    when PI identified somedomain14.com, no need to check somedomain14.com/somepath/somefile2.html
  • s4nt0ss4nt0s Houston, Texas
    Ahh, I see what you're saying. Well, the problem with that is we would need to do an extra check on every URL that it sorts. So a URL comes in, it has to check the identified URLS to see if that URL has been sorted before, which also means as the identified list grows, it's having to compare each URL against big .txt files.

    It would slow things down, use more resources, etc. 

    Best thing for you to do would be to setup an automatic dedup project and have it automatically remove dup domains. 
Sign In or Register to comment.