I exported the footprints from SER, and threw in around 400k keywords, and let it roll. I then remove all the duplicates, check the http status, indexing and PR.
1M scraped urls turns into 50-100k importable urls for me right now.
Not sure about where`s the bug. At some point I was tempted to learn programming but i gaved up as i realized "it`s not for me".
Regarding the way you build the list i`m doing 99% the same process . The only difference is that i only check Pr (on a domain level not url). I don`t bother with indexing or other seetings such as OBL (which i set in Ser anyway).
A quick advice to you : Just try the free version of "footprint factory" and watch the first video on the website. Then follow the instructions. I did that yesterday and i used Gscraper with the footprints from a platform. You will end up with millions of potential targets .
I let gscraper run for a limited period of time (a few hours) and now i`m filtering a 1.2 million urls pertaining only to a CMS platform (please note those 1.2 million are all unique domains) no duplicates at all.
And that`s only for a small run using the FF free version (pro gives you wild abilities in terms of footprints)
I plan to repeat the process for for every CMS (especially for the contextual and do-follow ones). This should put SEr on steroids....
Hmm... Right now I got a lot of these "The remote name could not be resolved" things as I checked the http status. But when I click on the url, it's still alive an healthy. I wonder if this is a proxy issue?
@seo4all, I just watched the FpF video on their website.
I don't quite understand what it does? Footprints? I mean... Are you sure SER can post to those links if it doesn't understand what engines they are a part of?
Artsi just like normal scraping you can`t know for sure if ser will post to them or not. But it will definitely post to a certain number of them. I wouldn`t be too worried about that.
With gscraper (by scraping using ser footprints) is the same thing. you find tons of potential targets but when you run you realize you could only post to a part of them. even if at least in theory ser can post to all of them .
That`s the reality with scraping. However FF provides a way to uncover millions targets which aren`t used by other gsa users.
Just try following the free video on their site. Download FF free generate some CMS footprints and scrape them in Gscraper. It won`t cost you a dime and at the end you might end up with many more targets which gscraper wouldn`t find on it`s own if you were using the default footprnts.
Hope that helps you.
P.s One thing i would like to reinforce here -Just as the FF video said select a CMS platform which gaved you good results as a starting point.
So, the FpF then... Did I understand it right, that I paste in some urls into it, and it then brings me footprints like the one I pasted over there, so that I can find more sites SER will understand as being part of the Article engine?
I want to get this, I'm thinking this is probably crucial for my understanding
With FF you`ll have to import 25 unique domains (in the free version pro allows you to upload unlimited domains)
Note that you must only upload unique domains.
After that check on the "Process text Snippets" on the left side and click on "Get Footprints"
On the "Footprint List Builder Tab" make sure to check "Put snippets in quotation marks" (this is required later on on Gscraper)
Once the program will finish click on "generate footprints" and export them into a txt file.
You have a few footprints which aren`t by default in SER. Take those footprints import them in Gscraper, import your keywords and you`re good to go.
An avalanche of potential targets. To filter them after the scraping is done just do the filters you would normally do-export the list-import in ser and let me know how it goes
So, I bring urls from my verified folder, say Joomla blogs (I currently have only 6 urls).
Here are the footprints from SER for Joomla blogs: "Fields marked with an asterisk are required" joomla "Please login to write comment" "add new post" "powered by joomla" "add new post" "Smart Blog" "Add new post"
So, the FPF would go out and expand that footprint list manyfold, so that I could then upload that into Gscraper, and go hunting for wya more Joomla blogs than I would find with those footprints from SER alone?
Is this correct?
And how do the keywords come into play here? Say I want to find sites about dogs, and I have 10k keywords about dogs. Will the FPF / Gscraper then randomly insert those keywords with the newly-found footprints to find even more and even more specific sites?
The keywords will only come into play when you`re using Gscraper.
If you want to find related urls you`ll put your keywords in quotes in Gscraper. I personally don`t use GSA to link directly to money sites so i don`t usually scrape niche related url.
I go with general terms because for me what matters are numbers not relevancy. It depends on what you`re trying to rank. But if you want relevant sites putting the keywords in quotes in Gscraper is the way to go.
One more thing... Do you know what the "delete if index < ____ " means in Gscraper? I thought it meant if there are less than a million websites for that particular footprint, but now I'm not so sure anymore..
And let's say I come up with 50k Joomla blog urls - as an example.
I then import that into SER. @Sven, could you help me real quick here... How does SER know that this particular url is a part of article engine - Joomla in particular? Does SER just go after the url, and if it can post to it, it'll then say "all right, this turned out to be a Joomal url, so let's put that into identified / verified folder"?
@Artsi "delete if index=.." in Gscraper is a function to delete the urls that are below the value you enter there.
Most likely you`ll put there a "1" Then you run an index check.
If on the index the value shows "0" after the check it means the url is not indexed in Google So in this case "delete if index<1" would delete all the urls which are not indexed
As far as importing to ser i wouldn`t be worried about it. You`ll import the txt file and before posting SER will automatically identify the platform
@Artsi - i think you`re also doing in van the index check (assuming you do the pr)
If the url has at least a Pr of 1 it means that 99% of the time it will be indexed in Google. (there are exceptions to this but very very few at least based on my experience)
Instead of running an index check i would recommend you to do a PR check (on a domain level not url), delete the urls which have a PR less than 1 and you`re good to go.
In the time it takes for a list to check the index you could scrape more targets.
Please note that i`m not by any means a "scraping master". What i`ve told you here is however what works best for me. I used to do the index check as well a while back and i didn`t noticed any productivity improvements or any better results. What i was noticing is that back then filtering a list was eating more time than it should
Comments
In case you want to downgrade, here's link to 8.0:
http://www38.zippyshare.com/v/55616772/file.html
Both are running at 300 threads.
I'll just mess around a bit more and see if I can get anything change on my own.
I will try with a list as well now to see if the issue is fixed or not
Artsi -may i ask you how do you build your lists ?
I run Gscraper on the Geek plan on SolidSEO.
I exported the footprints from SER, and threw in around 400k keywords, and let it roll. I then remove all the duplicates, check the http status, indexing and PR.
1M scraped urls turns into 50-100k importable urls for me right now.
I then just import them directly into projects.
I have this 8.0 version .exe available. So I started the SER from that, and problem persists.
Is this perhaps somewhere in the files and not in the software itself? I don't I'm not a developer, just guessing at solutions
Regarding the way you build the list i`m doing 99% the same process . The only difference is that i only check Pr (on a domain level not url). I don`t bother with indexing or other seetings such as OBL (which i set in Ser anyway).
A quick advice to you : Just try the free version of "footprint factory" and watch the first video on the website. Then follow the instructions. I did that yesterday and i used Gscraper with the footprints from a platform. You will end up with millions of potential targets .
I let gscraper run for a limited period of time (a few hours) and now i`m filtering a 1.2 million urls pertaining only to a CMS platform (please note those 1.2 million are all unique domains) no duplicates at all.
And that`s only for a small run using the FF free version (pro gives you wild abilities in terms of footprints)
I plan to repeat the process for for every CMS (especially for the contextual and do-follow ones). This should put SEr on steroids....
Hmm... Right now I got a lot of these "The remote name could not be resolved" things as I checked the http status. But when I click on the url, it's still alive an healthy. I wonder if this is a proxy issue?
Will check out the footprint factory!
I don't quite understand what it does? Footprints? I mean... Are you sure SER can post to those links if it doesn't understand what engines they are a part of?
With gscraper (by scraping using ser footprints) is the same thing. you find tons of potential targets but when you run you realize you could only post to a part of them. even if at least in theory ser can post to all of them .
That`s the reality with scraping. However FF provides a way to uncover millions targets which aren`t used by other gsa users.
Just try following the free video on their site. Download FF free generate some CMS footprints and scrape them in Gscraper. It won`t cost you a dime and at the end you might end up with many more targets which gscraper wouldn`t find on it`s own if you were using the default footprnts.
Hope that helps you.
P.s One thing i would like to reinforce here -Just as the FF video said select a CMS platform which gaved you good results as a starting point.
Hope that helps you
So, as an example... Here's one of the footprints from Articles -> Wordpress articles:
"Powered by WordPress + Article Directory plugin"
If I go into google with that, one of the results I find is this:
http://www.addnewarticles.com/health/cosmetic-dentistry-is-not-just-about-beauty.html
and I believe SER could post to that.
So, the FpF then... Did I understand it right, that I paste in some urls into it, and it then brings me footprints like the one I pasted over there, so that I can find more sites SER will understand as being part of the Article engine?
I want to get this, I'm thinking this is probably crucial for my understanding
With FF you`ll have to import 25 unique domains (in the free version pro allows you to upload unlimited domains)
Note that you must only upload unique domains.
After that check on the "Process text Snippets" on the left side and click on "Get Footprints"
On the "Footprint List Builder Tab" make sure to check "Put snippets in quotation marks" (this is required later on on Gscraper)
Once the program will finish click on "generate footprints" and export them into a txt file.
You have a few footprints which aren`t by default in SER. Take those footprints import them in Gscraper, import your keywords and you`re good to go.
An avalanche of potential targets. To filter them after the scraping is done just do the filters you would normally do-export the list-import in ser and let me know how it goes
So, I bring urls from my verified folder, say Joomla blogs (I currently have only 6 urls).
Here are the footprints from SER for Joomla blogs:
"Fields marked with an asterisk are required" joomla
"Please login to write comment" "add new post"
"powered by joomla" "add new post"
"Smart Blog" "Add new post"
So, the FPF would go out and expand that footprint list manyfold, so that I could then upload that into Gscraper, and go hunting for wya more Joomla blogs than I would find with those footprints from SER alone?
Is this correct?
And how do the keywords come into play here? Say I want to find sites about dogs, and I have 10k keywords about dogs. Will the FPF / Gscraper then randomly insert those keywords with the newly-found footprints to find even more and even more specific sites?
Thanks for the insights, @seo4all!
The keywords will only come into play when you`re using Gscraper.
If you want to find related urls you`ll put your keywords in quotes in Gscraper. I personally don`t use GSA to link directly to money sites so i don`t usually scrape niche related url.
I go with general terms because for me what matters are numbers not relevancy. It depends on what you`re trying to rank. But if you want relevant sites putting the keywords in quotes in Gscraper is the way to go.
One more thing... Do you know what the "delete if index < ____ " means in Gscraper? I thought it meant if there are less than a million websites for that particular footprint, but now I'm not so sure anymore..
And let's say I come up with 50k Joomla blog urls - as an example.
I then import that into SER. @Sven, could you help me real quick here... How does SER know that this particular url is a part of article engine - Joomla in particular? Does SER just go after the url, and if it can post to it, it'll then say "all right, this turned out to be a Joomal url, so let's put that into identified / verified folder"?
Or how does it work?
Most likely you`ll put there a "1" Then you run an index check.
If on the index the value shows "0" after the check it means the url is not indexed in Google So in this case "delete if index<1" would delete all the urls which are not indexed
As far as importing to ser i wouldn`t be worried about it. You`ll import the txt file and before posting SER will automatically identify the platform
solved for me ...way better version .27
@killerm, you're running ok at 8.27? How many threads / projects - and are you ONLY driving imported, external (not sites lists) lists?
If the url has at least a Pr of 1 it means that 99% of the time it will be indexed in Google. (there are exceptions to this but very very few at least based on my experience)
Instead of running an index check i would recommend you to do a PR check (on a domain level not url), delete the urls which have a PR less than 1 and you`re good to go.
In the time it takes for a list to check the index you could scrape more targets.
Please note that i`m not by any means a "scraping master". What i`ve told you here is however what works best for me. I used to do the index check as well a while back and i didn`t noticed any productivity improvements or any better results. What i was noticing is that back then filtering a list was eating more time than it should
One thing I'm wondering is this: isn't the FpF using proxies at all?
Another thing is this... One of the footprints I was given is this:
leave a comment
Isn't that going to be in gazillion other websites as well? I'm just thinking how useful it is to be scraping with such general footprints?