Proxies Question

Does GSAPI really need proxies?

I'm only identifying. Not checking for anything else such as moz data. # of backlinks etc.

If so there should be an import interval option for proxies. Let me know thanks.


  • shaunshaun
    I have received a few warnings over the years from my server provider when not using proxies in PI from web masters reporting the IP for hitting their site too much.

    Its a double edges sword, using proxies slows the tool down but it helps keep my servers safe.
  • @shaun How do youu ususally set the PI number?
  • shaunshaun
    @jonseo how do you mean mate? The thread count?
  • @shaun Yes, actually wanted to write "thread number" :)
  • shaunshaun
    @jonseo it changes between 500 and 1000 threads with bandwidth limiting on and off depending on whatelse that server is doing at the time.
  • @shaun thanks! interesting info... I use quite fast machines (hardware and connection-wise) myself. In my experience, PI consumes *a lot* more ressources than SER, DP or any other software at at compareable thread count.

    I know it depends on settings, but I guess it's mostly about the difference between posting vs. analysing - PI works a lot faster in terms of URLs (requests) per second.

    I usually set PI to about between 64 and 256 threads (on multiple machines) and I was just wondering if this kind of speed might cause said troubles...

    Generally, I like to be on the safe side, so I'll rather go with proxies in the future. Used to use PI without any in the past.
  • shaunshaun
    @jonseo without a doubt it is the thread count that I run at that gets me the warnings from the web server but I like to run at a speed that can kind of keep with with scrapebox. Since Monday I have pushed 81 million URLs through PI, then what I wanted from that goes onto my SER verification rig then from that onto my live projects.
  • @shaun That does make a lot of sense. I didn't think of this before, but with PRE running a dedicated DP / PI high-end VPS suddenly starts making *a lot of sense*.

    Keeping PI up with DP offers a couple of interesting new possibilties... I found that "deep matching" (or generally *all* CPU-intense options) increase id'ed ratio a lot. Do you check these options in your high-speed setup?
  • shaunshaun
    edited June 2016
    @jonseo I tried to test the deep matching when I first got PI but it raped my CPU lol. I left it since then but I have requested a new feature that should be here soon. Basically when you run a batch through PI you will be given the option to add the urls to an blacklist. This blacklist builds up over time and stores things you have already put through PI and prevents you from doing it again so it frees up a shit ton of space down the line. I have a way to do it with Scrapbox but having it as an internal feature should be a lot easier.

    As this will reduce the load of the server I plan to start testing deep matching when the blacklist feature is added to the tool as it should be a lot faster at processing scrapes and extracts.
  • Wow, this blacklist feature is a very nice idea!
  • shaunshaun
    jonseo Just though of another feature to save time too, have a way so you can put a bunch of domains like web 2.os or news websites that SER can post to and put those in a domain and PI remove any url you have scraped from those before even processing.

    Not sure how much tie that one would save compaired to how much time it would take to do but I have suggested it :).
