Skip to content

۞MOST COMPLETE GSA Keyword Scraping List۞ • 10+ Major Languages • CREATE Your Own GSA Site Lists◄

11214161718

Comments

  • @mda1125

    Hey, glad you're finding the list so exciting :)

    For your first question - yes, that would be the next step to do after scraping. GSA will identify them and sort them into their respective platforms. Make sure you clean them up first as stated in my guide before importing as that will save you a lot of time and prevent duplicates from hogging your resources. The reason why you don't want to truncate to the root is because some platforms (wikis, comments, forums) have their submission\registration pages on a subdomain - meaning the homepage is entirely text with no way of inputting anything whatsoever. If the URL turned up on your scape, it's best to leave it at that. Remove the duplicate domains though - that's going to save you a TON of time at the expense of a few missed opportunities.

    You can break the list up into smaller segments and scrape them accordingly one a time. That's what I like to do. It lowers the risk of crashing or causing your system to hang and is usually much more time efficient. I hope that answers your questions.
  • @FuryKyle

    Can you tell me what a "reasonable" amount of keywords to scrape might be with 20 semi-dedicated proxies?

    I'm freaking out that 26,456 keywords at one scrape (small list but all the footprints) is going to cause issues. 

    My results are set to 100 only using Google.  I'm just not sure what a good starting point might be to get some good results without taking 2 days or getting myself into trouble with all the searching.
  • edited September 2014
    @FuryKyle How extensive are your platform footprints list? Are they much larger than the default footprints provided in SER? 

    If for example, I scrape with your Drupal footprints with your keywords, will at least 5-10% of the result be actual Drupal sites I can post to instead of miscellaneous stuff?

    I'm mainly looking to scrape lots of contextual targets, so will this list work great for that?

    Using imwithmikey's guide, I've been gathering footprints from Footprint Factory for engines, creating lots of footprint permutations and scraping with them only (without keywords). But this approach doesn't work as well as I thought. Only a very small percentage of the sites I am scraping with this method is of the actual platform I am scraping for.
  • mda1125

    I highly suggest getting a dedicated VPS if you want to scrape all day. Just scrape, come back and check after a day or two depending on the size of your list and then rinse and repeat. You're going to get a ton of sites this way. I usually do 1000, but most searches don't yield anywhere near that amount anyway especially with quotes on.

    @Monie
    I get a pretty high success rate when importing my harvested list into GSA. You can take a look a few pages back where I posted images of me successful identifying new sites. If you want more accurate scrapes, you're going to have to be very specific about your list. Probably use more phrases for your permutations with double quotes. Personally I just load it all up and let it rip. I don't have to fuss around too much this way and I get the added benefit of getting other platforms as well.
  • When you guys use Scrapebox or whatever... and you ultimately have a list of XX URLs..

    1.  Do you deDup URLs and Domains?
    2.  Do you trim to last path and then to domain?

    Or do you just remove duplicate URLs... and then let GSA figure it out?

    I just ran a scrape using a 500 keyword list looking for phpFox sites.  Got back 12863.

    Now remove duplicate URLs?  Then domain or just the URLs and import it to targets for the project and let GSA do the job?
  • hi.. ive purchased this list and also have purchased ffp.   here is what im doing. 

    • im creating footprint using verified url - i try use settings so that it gives me about 2k footprints. 
    • the kw list is damn large about 330k
    • i load all into gscraper with the in built proxies.
    • i let it run for about 24 hours - it will create min 12gb text file - yep massive
    • gscraper will run at peak 1500 threads and 150k urls per min
    • i use SB to split the file. 
    • run a delete dup and create 5x cleaning projects in gsa
    i got about 300 verified. article directories from drupal mainly as that was my footrint, but also managed to pick up others like buddy press and catalyst.

    im gathering footprints per engine. 

    can anyone help me or advise on how i can make my process cleaner..  this is my first attempt at creating my own lists and im pretty new to gsa.  although ive learnt and picked up quite alot in a short space of time.

    a purchased list is giving me 500 verified context links. unless im doing something wrong?

    and just cleaning half of my own list after a 24 hour scrape has given me 300 verified context links

    i do also have hrefer.. if i use hrefer with proxy rack.. will i get much faster scrape?


    thank you in advance for all your help / advise
  • Do you deDup URLs and Domains?

    I've seen the topic go both ways.  Most would agree that a duplicate URL is no good.  But if you remove the duplicate domains, you also remove or could remove a URL to a domain that might have worked.

    Thoughts?
  • im at this point only deleting dup domains..  with just half of the 12gb file i managed to get 300 verifieds. 
    ive now scraped for another engine footprint. going to see how that goes and how many veridied context i get out of that scrape..
  • @Yusuf_h
    Deduping domains automatically dedupe URLs as well. 300 verifieds from a 12GB scrape is unusual. Are you combining the right footprints? Make sure you don't use proxies and your number of threads are properly tuned so GSA doesn't start timing out with connections.
  • sorry correction - it was a 12GB file.  and I only used maybe half of it. at time of posting I had about 300 verified. but i ended up with about 600 contextual verifieds from 6GB. I only used half the file to test. 

    as for footprints. i used whatever footprints ffp found for me. and the frequency i used was 20 or 30%. plus. i combined 2+2 in the settings as 2-3 was throwing back 100'000s combinations.

    On gsa i have 50 dedicated proxies and i use 100 threads. i dont get time out errors. html i kept at 180. 

    you said dont use proxies? i dont understand. dont use proxies for?. 
  • @Yusuf_h
    What I meant to say was disable your proxies when you are verifying\identifying sites to speed things up. I know some clients who accidentally leave that checked, and once their proxies start timing out, the identification number goes south.
  • Thanks Kyle!  Your keyword list is awesome.  I only did a tiny fraction of it after splitting it up.

    image
  • @mda1125
    Half a million identified sites, wonderful! With a few more scrapes I'm sure you'll surpass a million unique sites. 
  • @furykyle Finished the 2nd scrape last night and onto the 3rd as I type.  I'm already at 1.6M!
  • I'm having troubles when importing the footprints into scrapebox. It seems like many of the footprints include many different types of unicode angled quotation marks and when they are imported into scrapebox they get scrambled.

    Here are the quotation marks:

    http://prntscr.com/4tekty

    yellow - normal ansi quotes
    red - angled unicode quotes

    And here is how they look in scrapebox when imported:

    http://prntscr.com/4temgz

    I searched this whole thread and it seems like no one has this issue or nobody noticed it at all.

    I tried using the unicode convertor of scrapebox, but it converts even the non-unicode characters and it becomes even a bigger mess.

    Can anyone tell me why this is happening and how to fix it?
  • spiritfly

    That's weird. Are you talking about the footprints on the footprint list? You can try manually adding the quotes yourself with some regex if that is the case.
  • @deNiro72 It also includes footprints
  • Do you accept other methods outside PayPal??
  • Thank you @Vijayarag for answering for me.

    @CheapcaptchaCS
    At this point, I can only accept Paypal.
  • FuryKyle I see, thanks for the info
  • @CheapcaptchaCS
    No problem.

    All orders processed today.
  • xeroxiasxeroxias United States
    How does the lifetime package works? i get new keywords? thru your website or something.
    sorry i am new to all this!
  • xeroxias
    Everytime there is an update, you get emailed the new list. With the lifetime package, you are eligible for new lists forever.
  • edited October 2014
    FuryKyle:
    Just purchased: XXXX0239576081009
  • All orders processed today.
  • Lists sent out.
  • All orders processed.
  • A new update will be out shortly.
  • I bought the list like 6 months ago, is there an updated list? how do i get the
    new list?
  • @rad
    PM me your email and transaction ID, and I'll send you the updated list.
Sign In or Register to comment.