For your first question - yes, that would be the next step to do after scraping. GSA will identify them and sort them into their respective platforms. Make sure you clean them up first as stated in my guide before importing as that will save you a lot of time and prevent duplicates from hogging your resources. The reason why you don't want to truncate to the root is because some platforms (wikis, comments, forums) have their submission\registration pages on a subdomain - meaning the homepage is entirely text with no way of inputting anything whatsoever. If the URL turned up on your scape, it's best to leave it at that. Remove the duplicate domains though - that's going to save you a TON of time at the expense of a few missed opportunities.
You can break the list up into smaller segments and scrape them accordingly one a time. That's what I like to do. It lowers the risk of crashing or causing your system to hang and is usually much more time efficient. I hope that answers your questions.
Can you tell me what a "reasonable" amount of keywords to scrape might be with 20 semi-dedicated proxies?
I'm freaking out that 26,456 keywords at one scrape (small list but all the footprints) is going to cause issues.
My results are set to 100 only using Google. I'm just not sure what a good starting point might be to get some good results without taking 2 days or getting myself into trouble with all the searching.
@FuryKyle How extensive are your platform footprints list? Are they much larger than the default footprints provided in SER?
If for example, I scrape with your Drupal footprints with your keywords, will at least 5-10% of the result be actual Drupal sites I can post to instead of miscellaneous stuff?
I'm mainly looking to scrape lots of contextual targets, so will this list work great for that?
Using imwithmikey's guide, I've been gathering footprints from Footprint Factory for engines, creating lots of footprint permutations and scraping with them only (without keywords). But this approach doesn't work as well as I thought. Only a very small percentage of the sites I am scraping with this method is of the actual platform I am scraping for.
I highly suggest getting a dedicated VPS if you want to scrape all day. Just scrape, come back and check after a day or two depending on the size of your list and then rinse and repeat. You're going to get a ton of sites this way. I usually do 1000, but most searches don't yield anywhere near that amount anyway especially with quotes on.
@Monie I get a pretty high success rate when importing my harvested list into GSA. You can take a look a few pages back where I posted images of me successful identifying new sites. If you want more accurate scrapes, you're going to have to be very specific about your list. Probably use more phrases for your permutations with double quotes. Personally I just load it all up and let it rip. I don't have to fuss around too much this way and I get the added benefit of getting other platforms as well.
hi.. ive purchased this list and also have purchased ffp. here is what im doing.
im creating footprint using verified url - i try use settings so that it gives me about 2k footprints.
the kw list is damn large about 330k
i load all into gscraper with the in built proxies.
i let it run for about 24 hours - it will create min 12gb text file - yep massive
gscraper will run at peak 1500 threads and 150k urls per min
i use SB to split the file.
run a delete dup and create 5x cleaning projects in gsa
i got about 300 verified. article directories from drupal mainly as that was my footrint, but also managed to pick up others like buddy press and catalyst.
im gathering footprints per engine.
can anyone help me or advise on how i can make my process cleaner.. this is my first attempt at creating my own lists and im pretty new to gsa. although ive learnt and picked up quite alot in a short space of time.
a purchased list is giving me 500 verified context links. unless im doing something wrong?
and just cleaning half of my own list after a 24 hour scrape has given me 300 verified context links
i do also have hrefer.. if i use hrefer with proxy rack.. will i get much faster scrape?
I've seen the topic go both ways. Most would agree that a duplicate URL is no good. But if you remove the duplicate domains, you also remove or could remove a URL to a domain that might have worked.
@Yusuf_h Deduping domains automatically dedupe URLs as well. 300 verifieds from a 12GB scrape is unusual. Are you combining the right footprints? Make sure you don't use proxies and your number of threads are properly tuned so GSA doesn't start timing out with connections.
sorry correction - it was a 12GB file. and I only used maybe half of it. at time of posting I had about 300 verified. but i ended up with about 600 contextual verifieds from 6GB. I only used half the file to test.
as for footprints. i used whatever footprints ffp found for me. and the frequency i used was 20 or 30%. plus. i combined 2+2 in the settings as 2-3 was throwing back 100'000s combinations.
On gsa i have 50 dedicated proxies and i use 100 threads. i dont get time out errors. html i kept at 180.
you said dont use proxies? i dont understand. dont use proxies for?.
@Yusuf_h What I meant to say was disable your proxies when you are verifying\identifying sites to speed things up. I know some clients who accidentally leave that checked, and once their proxies start timing out, the identification number goes south.
I'm having troubles when importing the footprints into scrapebox. It seems like many of the footprints include many different types of unicode angled quotation marks and when they are imported into scrapebox they get scrambled.
That's weird. Are you talking about the footprints on the footprint list? You can try manually adding the quotes yourself with some regex if that is the case.
Comments
Hey, glad you're finding the list so exciting
For your first question - yes, that would be the next step to do after scraping. GSA will identify them and sort them into their respective platforms. Make sure you clean them up first as stated in my guide before importing as that will save you a lot of time and prevent duplicates from hogging your resources. The reason why you don't want to truncate to the root is because some platforms (wikis, comments, forums) have their submission\registration pages on a subdomain - meaning the homepage is entirely text with no way of inputting anything whatsoever. If the URL turned up on your scape, it's best to leave it at that. Remove the duplicate domains though - that's going to save you a TON of time at the expense of a few missed opportunities.
You can break the list up into smaller segments and scrape them accordingly one a time. That's what I like to do. It lowers the risk of crashing or causing your system to hang and is usually much more time efficient. I hope that answers your questions.
Can you tell me what a "reasonable" amount of keywords to scrape might be with 20 semi-dedicated proxies?
I'm freaking out that 26,456 keywords at one scrape (small list but all the footprints) is going to cause issues.
My results are set to 100 only using Google. I'm just not sure what a good starting point might be to get some good results without taking 2 days or getting myself into trouble with all the searching.
I highly suggest getting a dedicated VPS if you want to scrape all day. Just scrape, come back and check after a day or two depending on the size of your list and then rinse and repeat. You're going to get a ton of sites this way. I usually do 1000, but most searches don't yield anywhere near that amount anyway especially with quotes on.
@Monie
I get a pretty high success rate when importing my harvested list into GSA. You can take a look a few pages back where I posted images of me successful identifying new sites. If you want more accurate scrapes, you're going to have to be very specific about your list. Probably use more phrases for your permutations with double quotes. Personally I just load it all up and let it rip. I don't have to fuss around too much this way and I get the added benefit of getting other platforms as well.
1. Do you deDup URLs and Domains?
2. Do you trim to last path and then to domain?
Or do you just remove duplicate URLs... and then let GSA figure it out?
I just ran a scrape using a 500 keyword list looking for phpFox sites. Got back 12863.
Now remove duplicate URLs? Then domain or just the URLs and import it to targets for the project and let GSA do the job?
I've seen the topic go both ways. Most would agree that a duplicate URL is no good. But if you remove the duplicate domains, you also remove or could remove a URL to a domain that might have worked.
Thoughts?
Deduping domains automatically dedupe URLs as well. 300 verifieds from a 12GB scrape is unusual. Are you combining the right footprints? Make sure you don't use proxies and your number of threads are properly tuned so GSA doesn't start timing out with connections.
What I meant to say was disable your proxies when you are verifying\identifying sites to speed things up. I know some clients who accidentally leave that checked, and once their proxies start timing out, the identification number goes south.
Half a million identified sites, wonderful! With a few more scrapes I'm sure you'll surpass a million unique sites.
Here are the quotation marks:
http://prntscr.com/4tekty
yellow - normal ansi quotes
red - angled unicode quotes
And here is how they look in scrapebox when imported:
http://prntscr.com/4temgz
I searched this whole thread and it seems like no one has this issue or nobody noticed it at all.
I tried using the unicode convertor of scrapebox, but it converts even the non-unicode characters and it becomes even a bigger mess.
Can anyone tell me why this is happening and how to fix it?
That's weird. Are you talking about the footprints on the footprint list? You can try manually adding the quotes yourself with some regex if that is the case.
@CheapcaptchaCS
At this point, I can only accept Paypal.
No problem.
All orders processed today.
Everytime there is an update, you get emailed the new list. With the lifetime package, you are eligible for new lists forever.
PM me your email and transaction ID, and I'll send you the updated list.