Skip to content

Advantages of Scraping Your Own URLs over SER Finding Them?

edited November 2012 in Need Help
I've always been curious to know if there is any real advantage to scraping your own urls (using set footprints), importing them and letting SER do it's thing over allowing SER to scrape them for you during a typical project run, thus saving yourself the time in scraping. I can't seem to pick out a real advantage here. Example..........

I use SB to scrape Google and Yahoo urls with the footprint "member.php" "Powered by MyBB" "golf bags" and come up with a list that is imported into SER and run it. I just don't see how my results would vary much compared to just importing my keyword "golf bags" into a project, setting the appropriate settings to scrape Google and Yahoo SEs, and ensure that Forum > MyBB is checked.

Any reasoning to use one over the other is welcome........
Tagged:

Comments

  • AlexRAlexR Cape Town
    I for one have been thinking that GSA will find more results as it offers more SE's than SB. So have been wondering the use of SB compared to GSA. 

    I have SB and am just starting to use it for high quality article sites and blog sites while GSA is running. Don't need SB to sit idle. 


  • I'd say the real advantage is that you don't burn out your private proxies to search and check PR, and just use a massive amount of public ones just for scraping in SB.


  • AlexRAlexR Cape Town
    @pisco - even in SB they recommend using private proxies for scraping. They say it's far more effective. Fewer threads, but it gets made up by the time saved in finding proxies. 
  • Unless you have a really massive amount of them (like thousands), i think the right way to do this is just go with public (i use some shared lists, at lot better and still way cheaper than private) and let is scrape for several hours, as some of them get blacklist, this way you are saving your precious private proxies.

    As an example, I have a small list of private proxies (20 only) and sometimes while using SB i forget to use my shared list while checking for PR, after a few links all my private proxies are blacklisted, if i go back and do it with public or shared, sure it will take longer, but i will have all of them in the end.
  • @GlobalGoogler - Yeah...I generally use SB for high PR page domain comments. It's really useful for that, on top of scraping for other programs like SENuke, etc. I also use it for it's multitude of useful Addons.

    @ pisco - interesting theory, but I have never found any evidence to prove that any of my private proxies are getting blacklisted or temp banned by Google. I use SB all day, everyday, with 20 private proxies. The thing scrapes 100,000+ results for me in less than 10 minutes. I check PR, OBL, and then post comments all in the same sitting without any issues whatsoever. There is actually a huge difference in proxy usage for programs that utilize IE browsers as compared to no browser at all (SB and SER fall into this category). I'd definately use private and public proxies for programs that actually open a browser like IE and utilize the submissions this way.

    I have never personally experienced problems with my private proxies being banned/blacklisted within SER. I have tested with private proxies, no proxies, and public proxies. The result = private and no proxies usually receive around the same results, however public proxies have resulted in dismal results. Even the creator of SB and SER (guy in the tutorial videos for SER) state that private proxies or no proxies will always result in better performance. I have also watched (DVD and streaming video) and spoken to a TON of hardcore multitasking SEO experts that use 80+ private proxies with 5-10 VPS systems and they never once mention their private proxies getting banned/blacklisted.

    If you have any hard evidence to back up your claims, I would love to see it, as this could help improve everyone's performance on all automation software platforms.
  • I'm pretty sure you are getting temp bans at the very least on PR results with 100k and only 20 proxies, do you use the automator and just remove PR below certain limits or do you actually see the result of the PR check on the list (after it finishes the check) ? I'm not sure but when the proxy gets banned i think SB just says the PR is "-", and shows info regarding a dead proxy.

    I never said public or semi perform better than private, just that private aren't meant to be abused when you can get temp banned from it (unless you have a huge amount of private that you can replace from time to time), and this is why you configure a delay between each search on SER, otherwise you would just go all out.

    Here is a screenshot of a test showing it happening, proxies were going all out, and after a while they all come as dead, and the PR is not picked up.

    image

    I'm not trying to contradict any expert/guru, heck i'm still making pocket money from SEO, but fact is this is happening to me and that is why i use a shared list to scrape and pr check.
  • From the looks of that Screenshot you are using 100 Connections, That is your problem. 100 Connections with only 20 Private Proxies will get them banned.

    For 20 private proxies use 2-3 Connections.

    This is still faster than having tons of Public proxies and using 100 Connections. Why? Public proxies are unreliable and a lot of time will be spent skipping/removing dead proxies, Some public proxies are even transparent and show your REAL IP in someway.


  • AlexRAlexR Cape Town
    @Tank - 100% correct! They say you should your connections should be 10% of your proxies.

    100 proxies = 10 connections.
    20 proxies = 2 connections. 

  • edited November 2012
    Does that really make sense ? That means that in any given moment you are only using 10% of your proxies, always had the ideia it was the other way around 10 proxies 100 connections (for posting that works out fine)

    I'll just have to try that and compare if using 2 connections with 20 private proxies beats 100 connections and a shared proxy list (not talking public here).

    So regarding SER, what is the magical % of private proxie vs threads, surely it can't be the same 10%, otherwise i would only run 2 threads :).
  • edited November 2012
    For POSTING Using 10 Private proxies and 100 Connections is fine because you are not constantly posting to those same sites over and over.

    IF we are talking about Page Rank Checking or Searching (with the Google Search engine checked). Then you have to lower Connections or Google will ban your proxies.

    For anything relying on Google, connections should be 10% of your proxies.
  • AlexRAlexR Cape Town
    No of threads = 10% x number of proxies. That's your magic number! :-)
  • Don't you think that it is really too conservative for SER? I would have to use 2 threads using that value, someone who uses 100 thread shoud be using 1000 private proxies, i've read guys here going up to 300 threads (wonder what that would cost), the point is SER is not like SB:
    • Each search thread has a configurable time delay between searches
    • Not all threads are searching and doing PR Checks.


  • I think he was relying on Scrapebox with that 10% rule. For SER it won't make any sense.
  • AlexRAlexR Cape Town
    I see what you mean...for SB it's 10% of proxies = threads. For SER it depends on timeout, and number of SE's selected. 
  • @pisco - I don't use the automator at the moment (will in the near future) and I do check my url list once the PR is complete. I receive a nice list every single time, with a mix of PR 9 all the way down to 0 and NA. My results have been consistent for a very long time.
     That screenshot doesn't show or prove anything. That just means that the site has no PR. In some cases it could mean that the PR was just not able to be detected at that moment in time, but that's just theory.

    Even when scraping Google, I use around 40-60 connections with 20 privates. No problem. In my experience and many others I speak with who do 10X the amount of scraping/PR checking in mass amounts, the only problem you can face are complaints from the actual proxy provider for using the proxies too much, too frequently (if their shared that is).

    @Tank and GlobalGoogler - Never ever heard of that 10% rule (I do a lot of reading). Waaaay too conservative for my taste as well. I have never heard of anyone being that conservative with their paid private proxies. As long as your not being ridiculous with them and letting them go at 100+ threads with say 10 proxies, then I think we'll all be doing just fine.

    ANYWAYS.....the proxy discussion got my thread a little "sidetracked". ;) Anyone else have feedback as to what pros or cons scraping and importing your list will give OVER simply letting SER do it's thang!!
  • AlexRAlexR Cape Town
    @grafx77 - comes from Loopline on BHW and a few other places that write on SB. I also heard it from a few others at BHW. I find that if I keep it at 10% for threads as well as PR, then all works nicely...no issues. When I increase it to 15% things start giving trouble and its a hassle to resolve

    Agreed - would love to hear some feedback on your initial discussion...been wondering myself. I suppose you can run SB as well as GSA on a VPS and that way get more results. This way 2 programs scrape, not just one. 
  • "That screenshot doesn't show or prove anything. That just means that the site has no PR. In some cases it could mean that the PR was just not able to be detected at that moment in time, but that's just theory."

    Guess you just don't want to see it then, no PR is "N/A", not "---", when the status says "Error-Dead Proxies" basically means they were temp banned, proving my point.

    Either way a good test to make out any difference in speed/performance would be running SER with a set of keywords for a test run (letting it scrape for itself) for a day and check submissions/verifications. Then do the same test (clearing history and cache) but scrape the keywords with SB, and let it run also for a day; then compare the results.

  • @GlobalGoogler - Yeah....I frequent BHW (not as much as I used to, due to being busy with my own business) and have always read about everyone using their private proxies more extensively on threads. There will always be a few that will follow the 10% rule you state, but I am willing to bet the overwhelming majority are not.

    @Pisco- My bad buddy! I only skimmed the top of your screenshot. Didn't take notice of the bottom part where it states "Bad Proxies". I have never got that message when doing my checks before, but I'm happy to see SB at least reports that.

    Back to topic.....SB Imports VS SER
  • If you are serious about scraping, i'd suggest you give hrefer a try (it's pricey, but better than scrapebox by far) - it's just more convenient, because you can use more than 1000 footprints at once, mod your own search engines and get new proxies on the fly (either you create your own proxy engine.php or you buy a cheap public proxy service that supports hrefer).

    We scrape 2 mil links+ per day (uniques) over all platforms and footprints and then identify and sort them into our linkdatabase with GSA. And you don't have to worry about your proxies being banned. Our keyword list is roughly 500k words and it will takes ages to run through it with a lot of footprints and SE - even with 400 threads ;)

    Having said that, we also have a few project where we let GSA scrape the target sites. But for the majority of projects we import our own linklists.
  • I've used Hrefer before and didn't see anything that made it distinct from SB. I can use as many footprints as I want with SB and Scrape as many proxies as I need as well (millions if need be). Your still having to sift through and sort all the links before importing them into SER (time consuming), checking to see if all the engines match up with the SER database, and then importing them in.

    You haven't outlined a distinction between why you scrape with Hrefer and just not allowing SER to do it for you. Your basically commenting on how Hrefer is better at scraping than SB and you import a ton of urls. I hope you realize that doesn't cover the basis of the original question.
  • AlexRAlexR Cape Town
    I'd also be very interested to get an answer to this question. 
  • edited November 2012
    Compared to SB:
    It's all about automation and quantity really. Hrefer can run days without having to change anything - Everything runs automatically: new Proxies every 40 Minutes, de-duping on the run + you have A LOT more seach engines at your hand, especially if you know how to add new ones to the engine. Identification is done automatically on the run of the raw scrape - so that's not the problem.

    Compared to GSA:
    Honestly, we haven't played that much with the GSA Scraper (Options -> Advances -> Find Online URLS), because we run a lot of projects over our GSA licenses and currently have no intent & no dedicated line to test it properly. Though, we may try it in the future, as it's basically the same as hrefer in terms of functionality, but we haven't tested it enough for a rational comparison, especially regarding speed, stability and automation. Further, our public proxy source is made for hrefer (engine.php) - we know how to use it properly and it's running fast and fully automated (days without changing anything). 

    But the GSA Scraper will be tested thoroughly - though at first glance, I can't seem to find a way to add a set of pre-defined keywrods to the scraper. 

    Also I'm not talking about scraping targets within a project, but externally.
  • SB has a new Automator plugin feature that enables it to run many tasks, on a schedule, including scraping new proxies, posting comments, checking backlinks, checking PR, etc. However, it doesn't scrape more than the TOP 3 (Google, Yahoo, Bing) SEs, but I'm not too concerned with this.
    I guess it comes down to what program your more comfortable with. I've been using SB for years, so I see no need to go elsewhere, as the advantages are none or very miniscule.

    2nd point- I wasn't really talking about comparing SB with GSAs scraper, but more so, comparing imported scraped urls into SER over typing in your own keywords and just letting SER run it's course. THANK YOU though for your feedback. It is greatly appreciated.
  • ronron SERLists.com

    Yeah, this thread got sidetracked, but in a good way :). Some great information on SB.

    @grafx77, getting back your original question, I think it has to do with efficiency. GSA can run on full throttle with all threads on a list that was fed to it, whereas the GSA scrape is confined by the timing parameters for the scrape and the timing of the search engines' responses.

    Personally, if your needs are getting to the point where you need 50,000 or 100,000 or 1,000,000 links a day, then you will need that other heavy artillery. Like ranking fast on hard keywords, etc.

    But for what most of us do, I think GSA works pretty darn good.

     

  • @ron- good point! The only value I see in scraping your own list, is by taking the load off of SER and opening more threads for it to post to, BUT on the other hand, that requires more "hands on" work and most of us want full automation.
    Obviously the timing parameters can be adjusted and the same parameters are true for SB and other scrapers, so this point is invalid.

    I can see using scrapers more in conjunction with SER for doing MASSIVE campaigns, to help out with the load, would be very beneficial. I think this may be the best answer we've got so far on the topic. Thank you.
Sign In or Register to comment.