GSA Footprints

kijix84 · July 2016

guys many of the footprints for all of the different engines are out of date and no longer return a healthy amount of results

sven is willing to listen to suggestions for adjustments, so for those who are willing, it would be nice to have some help with this task of weeding out the bad footprints.

if you want to help, simply choose a platform (article, blog, social network, wiki etc) and pick some engines from that platform. open up options -> advanced -> search online for URLs -> Add pre-defined footprints -> choose a platform -> choose an engine

the foot prints it populates you just need to put into scrapebox and use the google competition finder

imo anything that gives less than 5k results is not at all worth keeping and just slows SER down.

actually since many people will be using said footprints i would say anything under 20k even is not worth it.

many of you may use lists or scrape outside SER but lately i have been using GSA to do my scraping and it can work quite well but many of these foot prints have to go and it is a rather big task for one person to do alone

if you are willing to help please post here or PM me and send me your results and i will get everything together to send to sven.

shaun · July 2016

Good idea, I pre filtered them a while back but only kept the platforms I use for both contextual and non contextual that had results over 50,000 or it might be 80,000 not 100% sure to be honest. They are literally just text files of all the footprints though no data about their platform or anything.

If you think it will help you out let me know and I will paste them here.

Johan · July 2016

Please don't do this.

How do you judge a 'healthy number of results' ?

"imo anything that gives less than 5k results is not at all worth keeping and just slows SER down."

Exactly. It's just your opinion. It doesn't mean Sven should make changes that will affect every user. Those 5k sites might contain gold dust. How do you know you're not sacrificing quality for speed?

I'd prefer to make those decisions myself. The only question that ought to be considered is whether the footprint itself continues to work.

kijix84 · July 2016

@shaun, yes or you could send them in PM. i'm putting together a list.

@Johan, exactly how many links do you think you are going to be able to build with thousands of GSA SER users all using foot prints for under 5k results? do you honestly think you will find quality with so many users using the same foot prints? there are foot prints that return 0 results or even under 100 still being used by SER right now and they are hard coded into the program.

if i was able to remove these myself it wouldnt be a problem but you cannot.

710fla · July 2016

@shaun if you could post them that would be great. Ive been tweaking my own but never able to get many verified targets from raw scrapes with Scrapebox

shaun · July 2016

@kijix84

@710fla

Something somewhere has went tits and the list has been over written by some scraping results. TBH it never took me long to make with the link extractor. If I have time I will try remake it. It will literally just be a list of the default ones though, the footprint along with its search results. I'm not giving any custom ones away.

If you still think it would be useful kijix84 let me know and I will try get one started asap.

kijix84 · July 2016

@shaun yes I don't need any custom footprints just the default ones and their search result counts

shaun · July 2016

@kijix84 Starting a scan on the 327 footprints SER has for the main contextual platforms. I will post the results when it is done, I wont have the results until late tonight or tomorrow morning while I wake up as I have my Google proxies working on stuff for live projects so have to balance the load with them.

Johan · July 2016

"if i was able to remove these myself it wouldnt be a problem but you cannot. "

Yes you can. They're in the Footprint Studio. Just select the ones you don't want and hit delete.

I understand the point you're making, I just think it's for individual users to decide what they keep or not.

710fla · July 2016

@shaun since you mentioned Google proxies, what's your proxy to thread ration when index checking domains in Scrapebox?

I've been scraping and posting to Bing targets and wanted to see how many domains are actually in Google's index.

I tried scraping Google with Google passed proxies but the only real efficient method was to pay monthly for rotating proxies.

I already had GSA Proxy Scraper so decided to start scraping Bing.

When scraping for tier 2 links like blog comments I scrape Frontpage.com since they get their results from Google API and have a less stricter ban limit for proxies.

shaun · July 2016

@710fla mate it takes the fucking piss, it really does. I remember like two years ago when a proxy checking something was fine doing so once every ten or so seconds provided it did not have enhanced modifiers in the query, now I find I have to go to around once every 70 seconds per proxy

.

So I just try to have my connections and delay set out in a way that will connect only once per proxy every 70 seconds. It does depend on your proxy provider though, a friend has his at once every 40 seconds but his proxys are too expensive in my opinion.

There are other ways to do it but this way lets me do it with resources I already have at hand and gets the job done without any bans.

kijix84 · July 2016

@johan standard footprints are hard coded. You can't delete them with footprint studio. Sven confirmed this.

710fla · July 2016

@kijix84 just edit the .ini file in your GSA folder

kijix84 · July 2016

@sven can you weigh in here?

710fla · July 2016

@shaun thanks for the input. So 70 second delay for index checking in Scrapebox or for scraping Google?

shaun · July 2016

@kijix84 you can remove them from your .ini as well as add your own custom ones to my knowledge.

@710fla For Google index checking I have it set so each proxy checks one url per 70 seconds. Lets me do my index checks over a long period of time thus able to check shit tons of URLs without proxies getting banned. For scraping I just scrape bing now and then google index check and verifieds I get. Over all I feel it is quicker.

710fla · July 2016

@shaun thank you man really appreciate all your posts.

So I'm assuming you scrape Bing for blog comments, guest books, and image comments too?

kijix84 · July 2016

@shaun well that would be problem solved wouldn't it? what is the ini file called, as i didn't see one related to this. i only saw ini files for engines that i added custom footprints to.

redrays · July 2016

@shaun - good share as usual. Do you have a feel for what % of the verifieds you scrape from Bing end up being indexed in Google?

shaun · July 2016

@710fla no worries, emm it depends I usually use bing and footprint searches for contextuals only and link extraction for blogs, guesbooks, image and trackbacks although I currently just use a paid list for my non contextuals to test if it is easier. You dont need proxies for link extraction so you can totally rape a list and with internal/external extracts your non contextual list grows expinentially. You can also use the method to aquire other peoples lists.

All those people paying for SER lists but as soon as you get a few of their blog comment targets you can extract lots of their contextuals.

@kijix84 emm you go into the SER folder, then the Engines folder within that and then there is a footprints section in the engine text files what you modify if I remember correctly.

Not sure if it is problem solved, tbh I quickly read your first post and though I still had my old list so was going to give you it till I realised it had been saved over :P Anyway the google extract has finished so I will paste its results below this post incase you still need it.

@redrays mate its all over the place tbh. From a footprint scrape around 50-60% get identified I would imagine, out of that its engine dependant on what gets verified but its a very small percentage, out of that about half are no follow so I sent them to an inactive back up incase I ever see a reason to use them. Out of what remain a fair chunk are indexed in Google although over time they do drop out sometimes.

I did have a spreadsheet with a full breakdown a while back but I dont have it anymore

, some of it could make you cry lol. For example a Drupal scrape of 1 million URLs before dedupe would result in say 5 or less verified that pass my criteria and move to live projects. When you put it like that I can see why people buy lists.

But with link extraction its crazy cheap, around 30-40% of the list will be identified by GSA PI into contextuals, blog, image, guestbook and trackback and out of that it would have a much higher percentage that goes to live projects. Thing to remember is you have no conrol of the platforms you get and for some reason the vast majority are non contextual. But then you internal extract those, process them and repeate the process.

Sorry for the typos, its late hear and my heads hurting so I cba to fix them :P.

shaun · July 2016

https://pastebin.com/LzfDBDky

kijix84 · July 2016

@shaun thanks for both. i checked and you're right, footprints are in the engine files themselves. this is perfect

710fla · July 2016

@shaun thanks a lot you just saved me so much time. Appreciate you taking the time to help.

I'm editing my footprints tonight and gonna scrape Bing and post to the raw list tomorrow.

kijix84 · July 2016

if only we had a script that would search through all files in a folder for the specified text and delete a string you specify.. that would be the fastest way to do this

redrays · July 2016

@shaun - yep, I'm getting into link extraction at @kijix84's helpful suggestion. Glad to hear it's working great for you too.

bencrabara · July 2016

Excuse me for sdvertising but y'all should check out SEO LIST Builder in BHW. It uses the same link expansion technique but is automated for those with automator plus it is pretty cheap.

GSA Footprints

Comments