Skip to content

Fast Niches - 'Churn and Burn' Techniques?

24

Comments

  • goonergooner SERLists.com
    @coneh34d - I'd be interested to know how you get on with the foreign keywords, i bought the lists but haven't used them yet.

    Automation is not really a problem for me, i have 1 server building a verified list and feeding 3 others automatically. So that's ok - I'm just having performance issues now, out of memory on dedi, lag on vps etc

    As i solve one problem another appears!
  • @gooner Windows 2008 or Windows 2012? The languages are big bonus and you should definitly implement building your verified up with them.
  • goonergooner SERLists.com
    @coneh34d - I'll give the languages a go, cheers.

    2008 on the dedi.
    2012 on the VPS' - I should get that changed i know.


  • @gooner I had a look at the series for scrapebox, and it appears I already have used the techniques mentioned.

    How many proxies are you using and what kind to scrape in scrapebox?
  • @gooner What are your techniques with SB? I mean how many threads for each search engine? Like 25 for Google, 25 for bing and so on? And what is the harvester timeout you set based on it?

    I had some 35 proxies working and I set mistakenly set 75 threads for just Google and 45 sec timeout and it burnt all of them lo, none works now.
  • goonergooner SERLists.com
    Hi @pratik,

    I have 100 proxies with 25 threads for each SE. 90 second timeout.

    I run 1 scrape every day and recently i see that i can only get 300k url's from Google, a few more from Bing but Yahoo always gives me 2 million or more. So it works out ok.

    If i leave SB alone for 2 days i can get a million or 2 from each SE. Hope it helps.
  • edited January 2014
    Also @gooner do you sort/identify in SER or just import target URLs right away after removing dups? Do you still get good performance if not doing so as there might be many useless URLs.

    Also do you see your proxies blacklisted in the blacklisted column while scraping even at those minimal settings? If yes, do they usually work fine maybe the next day or so or stays blacklisted? I find bing blacklists it more than Google in SB.

    Edit: Saw your reply above now. Thanks for answering the previous question!

    Cheers.
  • goonergooner SERLists.com
    @pratik - No problem.

    I see some are blacklisted in that column but i don't think it's accurate. Google shows low blacklisted but scrapes the least url's and yahoo has many blacklisted but scrapes the most. (Bing has many blacklisted too) Doesn't make sense to me.

    All seem to be ok after 24 hours, but i am not using them for scrpaing in SER of course.

    At the moment i have a VPS just for processing scraped lists, direct import and then my other servers use the verified lists from the first VPS. I get about 10k new unique domains per day, it's plenty for me.

    But i am still new to scraping, i can improve it for sure.
  • @gooner So you use sort/identify feature of SER in the VPS which you say you allotted for processing the list?

    And oh, one more thing. What value do you use for results per page? I use 25 to make sure no useless sites are included and it has maximum sites on which SER can post too. Going too high usually (I think) results in lots of useless URLs and non-relevant platforms.

    As how I see, you seem to have some custom software or setup which processes them and sends the list on the servers where SER uses to blast them? Cool.
  • goonergooner SERLists.com
    @pratik - I import directly into projects. Sort and identify takes too long.

    Results per page is default, whatever that is. I didn't even know that setting existing until you mentioned it :)

    I use dropbox to share the verified lists. Very quick and easy. There are a few things to note if you use that method, i think @satans_apprentice made a post with all the details recently.
  • @gooner Cool. I too directly import it. It indeed does take too long.

    Dropbox, yes I use it just for uploading backups. Will look into it definitely in future.

    Thanks once again!
  • goonergooner SERLists.com
    @pratik - No probs mate.
  • Tim89Tim89 www.expressindexer.solutions
    @gooner & @Pratik - how does it take too long to "import and sort in"? stop your projects, ramp up your threads to 1,000 +/- then import them...
  • goonergooner SERLists.com
    I import over a million url's per day, import and sort in takes way too long.
    Easier just to let SER see if it can post to them and no need to stop projects.
  • How often do you 'clean' your lists @gooner? I have now stopped import/identify and import directly now, thanks it works so much faster. 
  • goonergooner SERLists.com
    @judderman - I do that about once a month, but i've noticed it works much better if you check the option "disable proxies" when you run it. First time i run it with proxies it deleted half my list, gutted!


  • @gooner What exactly clean lists does? Removes non-working site lists I assume?
  • goonergooner SERLists.com
    @pratik - Yes exactly that.
  • goonergooner SERLists.com
    Here's a question for you guys... If you delete duplicate domains, that should also delete all duplicate url's right? Because if you only have 1x each domain in the list it's not possible to have any dup url's.

    So, how come if i delete dup url's after dup domains it still finds sites to delete?
  • That's weird indeed @gooner. I tried dup domain once but reverted back as I got scared (lol) and wanted to see more people giving tries to see how it works.

    @Tim89 Thanks. Never tried upping threads to 1K but will definitely see.
  • goonergooner SERLists.com
    @pratik - Yea i know what you mean, i'm not confident in some of the cleanup features either.
  • edited January 2014
    Here is a helpful video that covers cleaning up your site list:





  • Thanks @gooner, I never click the disable proxies bit....damn..
  • Tim89Tim89 www.expressindexer.solutions
    @gooner Well if you're importing 1 million fresh targets per day directly to a project (this makes no sense), every single day, theres no way on this planet your 1 project is processing these targets... you must have a massive backlog.


    you can easily import and sort your scraped lists within a couple hours tops.. then these targets will get hit by all of your projects as they will be available in your identified list, meaning you will be attacking all of these identified targets, with less resources being used.


    you can probably import and sort a million urls within a couple of hours and possibly increase your threads to 5000, all ser is doing is checking the url to see if it finds a matching footprint, then stores it into the corresponding identified sitelist....


    importing these raw scraped lists directly into a project uses resources as it attempts to create and post to them, which is also dependant on proxies and connection... using my method eliminates this.

  • goonergooner SERLists.com
    It's not just 1 project, it's a whole VPS full of projects dedicated to processing scraped lists.

    I hear what you're saying but i've tested both and import and sort as at least 3 times longer to process the same number of url's.

  • Tim89Tim89 www.expressindexer.solutions
    A "whole VPS" isn't much power in regards to processing millions of URLs per day (even if you have lots of projects, at the end of the day, SER can only do so much with the hardware and threads you set things at), the last time I checked, a VPS isn't quite so much "dedicated" either, they are meerly replicated virtual private platforms shared amongst many individuals, in essence, a new windows user with dedicated portions of ram/hdd space etc etc.

    I'm coming from a perspective of having 4 dedicated machines (yes, actual machines with 16/32gb of ram each) running close to 1000 threads 24/7.

    I'm not saying what you're doing is the wrong way of doing it, I'm saying it's difficult to change something you do that you believe is worth while, I'm giving everyone a much much much more solid solution that they can work with which is much less resourceful and personally, I suggest you try it this way too as it could only benefit you.

    I started out doing what you are doing when I purchased GSA SER, scraping and importing into a project that is set not to scrape search engines etc, over grew that and found a more logical method which was staring at me in the face, hence the option "Import URLS - Indentify platform and sort in".

    I'm not sure if users know, but all you need to do is "Stop" your projects, by hitting the big red stop button, then go to "Options" then set your threads to what ever your MACHINE can handle, not your connection speed, but your machine, if you have a beefy machine, you can go all the way up to 10,000 threads, it doesn't matter, then "Advanced -> Tools -> Import URLs - Identify Platform and sort in" and see how fast GSA sorts out your scraped list, disgarding all unpostable sources, it takes me minutes to process tens of thousands, the last list I imported had around 40,000 sources, which isn't that many, but I processed these in around 10 minutes, if that.

    each to their own I guess
  • AlexRAlexR Cape Town
    @Tim89 - "m not sure if users know, but all you need to do is "Stop" your projects, by hitting the big red stop button, then go to "Options" then set your threads to what ever your MACHINE can handle, not your connection speed, but your machine, if you have a beefy machine, you can go all the way up to 10,000 threads, " 
     
    Thanks! Great tip!
  • @tim89 I always find that I need to *lower* threads to import as I max out the CPU usage and SER stops responding - eg I'll post at 800 threads but sometimes need to drop to 400-500 in order to keep the machine running smoothly (I am fairly sure that SER not responding results in many urls timing out rather than being correctly identified)
    I'm using a dedicated server with a Xeon E3 - perhaps not the top of the range, but still a solid multithreading cpu. I'm curious as to how you manage to import at such a rate? What does your box run on? (I know you bulit it yourself..)
  • Tim89Tim89 www.expressindexer.solutions
    edited January 2014
    @namdas I'm not so sure there is any loading of these urls to the web to indentify its platform, this is why I increase my threads to such high limits, this isn't effecting my proxies or connection.

    Yes, by doing this will increase CPU load for sure, increase your HTML timeout if you want to run SER at higher threads when posting.

    Unfortunately, your machine is only capable of what it can.. hardware wise, if your CPU maxes out at 500 threads, then that is that, saying that, it is also a piece of software and very RAM dependant, so it's not entirely CPU related.

    How much ram do you have in that machine?

    this is roughly the spec for all my machines;

    i7 3770k 4.0ghz
    32gb ram

    I overclock all my computers just a little, some clock speeds are sitting at 4ghz, some at 4.5ghz.


  • It's an E3-1245 v2 @ 3.4 ghz with 32gb ram running 2008 r2 - but I have never seen SER use more than 2gb, sadly.
    I have a feeling in part the cpu load depends on the size of the pages too - eg 1500 threads of a list of contextual articles will be fine, while 1500 threads of a trackback page with 2000 OBL will probably clog things up a fair bit.

Sign In or Register to comment.