Scraping my own set of targets

velsytetra · July 2014

I have been watching a Mathew Woodward video in which he uses a tool called Scrapebox to, I think, find his own list of target sites based on a platform list for Articles, Wikis, Bookmarks, and Videos. Is there a place in SER where a list like this can be inserted? Is doing this a benefit, or is it just as good to let SER find the sites based on the keywords I gave it? Are people out there using Scrapebox?
Thanks,

loopline · July 2014

I have used scrapebox almost 365 days a year for the past 5 years and still use it daily and plan to use it for as long as its around. If you search around here you will find that many people recommend using scrapebox with GSA SER.

Scrapebox is more of a swiss army knife of internet marketing then it is just a scraper though. I use multiple elements of Scrapebox in conjunction with GSA SER.

jpvr90 · July 2014

If your looking to scrape targets go with Gscraper.

loopline · July 2014

Why Gscraper?

IMHO Scrapebox is Exponentially better in every way. Plus its can do so much more then just scrape when it comes to ways to use it with GSA, and Sbox support is amazing, I hear Gscraper is terrible. I just find no reason to use Gscraper at all.

jpvr90 · July 2014

@loopline See my reply...For purely scraping Gcsraper hands down beats Scrapebox.

I know your hardcore SB fan boy so try not to get sensitive when users recommend other tools. Have you even used Gscraper? Any user that has used both knows that for pure scraping Gscraper is king.

Gscraper for Scraping speed, ability to import proxies at timed intervals, can handle infinite amount of keyword/footprints, ability to delete dups on the fly, ability to adjust target file size, ability to go above 500 threads, etc.

Scrapebox cant import proxies in multi-thread harvester only in custom harvester.
Custom harvester has no ability to remove keywords/footprints already processed.
Honestly custom harvester sucks,I don't like it.
Scrapebox cant handle very large keyword/footprints it will crash.

Don't get me wrong I use SB on daily bases for other things and own several licenses. But for purely scraping SB has been uncrowned by Gscraper.

squawk1200 · July 2014

Can you point me to a Youtube video (I know you have a ton of them there) on how to use SB to create your own list of targeted URLs? Can SB find targets that are not just for blog comments but other platforms as well?

Brandon · July 2014

@loopline @jpvr90 I love scrapebox and use it daily as well. For scraping however, nothing compares to gscraper. Scrapebox always wants to crash over 1 million URLs, I harvest about 100k per minute with gscraper.

loopline · July 2014

@jpvr90
Im just trying to understand, because anyone that I have encountered that thought such did things that specifically made it so.

By that I mean I don't import proxies, I don't need to, I just use private/shared proxies wisely and they never get banned, so no need to hassle with public proxies, importing/testing messing around etc... It just works and it always work no hassle, I don't understand why people persist at using public proxies anwyay.

To that end when using private proxies my keywords finish. I don't have to worry about keywords not finishing because with solid always on, non blocked private proxies they just finish and I don't need to export uncompleted ones.

I can import 800K to 1 million keywords into scrapebox, from which I can scrape tens of millions of urls. I can scrape urls faster then I can process them.

I also don't need to dedupe on the fly. If I dedupe at the end or on the fly, it makes no difference and probably slows the process down to dedupe on the fly. This is not an advantage, but rather a personal preference, which you are fully entitled to as am I.

I don't need to go above 500 threads, I prefer staying at lower thread counts, not getting my private proxies banned and the overall value of finishing a list of keywords without having to start them over saves tremendous time.

How is adjusting target size an advantage? I seriously don't understand so just asking.

Custom harvester doesn't need the ability to remove keywords already processed, you can set up to 99 retries and if your list finishes the first time around, there is no export non completed hassle.

Honestly custom harvester is extremely powerful if you know how to use it to your advantage.

But if you like, SB 2.0 will be 64bit and can handle as large of keyword files as you care to load in, its not done yet, but over 740K lines of code, last I heard. But again I don't need to harvest more then 800K keywords in a go, when they all finish thats a possible 800 million results.

Which is why I mean that the Gscraper is no "better" then scrapebox, however if you use both programs in such a way that it causes one to perform better then the other due to tailoring to your usage style, it could seem that way. However if you used it a different way it can be much faster, so its subjective, is all I was getting at.

I just focused on how I could work withinside of SB parameters to make it work the best, not on how I want to use it. Thus I can make it dance faster then I can keep up.

Use it how you like of course, thats the point, but just because your use causes something to appear a certain way, doesn't mean that it actually is that way.

I use GSA SER differently then many people here on the forum, I don't give a rats rear end about LPM, its not about speed, its about quality. But my sites are ranking, and thats the end game for me.

@squawk1200

By create your own list of targeted urls, what do you mean? I mean are you talking auto approve, or that meet certain criteria or what? If you give me specifics I can give you a more accurate answer.

Scrapebox can scrape any url that google and other engines can return, so you could scrape any platform. You can take the footprints that are in GSA SER that it provides for all its platforms and then scrape those. Wikis for example, SB can't post to them, but you can scrape them and then load them into SER. SER gives you all the footprints it uses.

@Brandon
If you harvest 100K per minute that is a little over 1 trillion urls per week. What do you do with them? I have some serious servers, but I can't handle a trillion urls a week.

SB routinely lets me harvest over 100 million urls with no issues. Seriously though, how do you deal with a trillion urls a week?

Brandon · July 2014

@loopline I dedupe and every URL gets run twice through my software looking for targets. I run them twice as there are a lot of variables that could affect whether or not a link works at any particular time. I don't run a trillion as that's with all the dupes, but it's easily in the high millions, low billions I would estimate.

loopline · July 2014

Well low billions is impressive, I have to say.

umerjutt00 · July 2014

I would say go with Gscraper. it just amazing when it comes to scraping.

satyr85 · July 2014

loopline

I tested Scrapebox and Gscraper many times with my own public proxies system (from 5k to 10-15k google passed proxies at any time)

Gscraper was able to harvest with speed 150-200k links per minute. Scrapebox was not able to go above 25k links per minute. SB was my second seo tool and i like it but when it comes to speed and stability Gscraper is superior.

Here is Gscraper and Scrapebox running at same time. Same proxies, same keywords, same number of threads, timeout etc and Gscraper is 10 times faster.

jpvr90 · July 2014

@loopline So you admit you are conservative scraper. Unless you have +x000 of private/shared proxies I see no way you can can scrape high volume of urls (in short amount of time) with current state of SB. And when I say high volume I mean +x00,000,000.

Importing proxies at timed intervals is extremely important when scraping high volume urls. Especially when you get into advanced operators like "inurl:, intitle:, etc."

Trust me when I say SB cannot handle very large keywords...I tried many times and got sick of it crashing all the time.

For hardcore scrapers Gscraper is best choice.

Olve1954 · July 2014

Hi @loopline, welcome to the GSA forum...

I'm too using scrapebox to scrape, and I'm happy with the results. 6-8 million raw urls per day on my old dual core connected to my home broadband. But I do use multiple instances of SBox, each scraping different sets of keywords for different platforms. I then import, identify and sort in SER. Having grouped the urls in their respective platforms (article, wiki, social network, etc), I can select which platforms I want to identify in SER, and this greatly speeds up the identification process.

@loopline, is there any advantage selecting different locations? e.g. google.com, google.as, google.co.in etc.

Is there any advantage selecting different time span? e.g. Anytime, Past 24 hours, Past year etc.

When you say "use private/shared proxies wisely and they never get banned", does it mean you're not using advanced operators like "inurl:, intitle:, etc.". My proxies do last much longer when I don't use these operators. And my question is, what's the disadvantage of not using these operators? Am I losing out on lots of potential sites?

@jpvr90, you maybe right "For hardcore scrapers Gscraper is best choice.", but you'll need some serious hardware/servers to process those millions of urls, and the identification process is very CPU intensive.

@velsytetra, if you're the average guy (like me), not "hardcore", then Scrapebox is good enough for you. Plus it's a must have utility for any webmaster.

You asked,

>Is there a place in SER where a list like this can be inserted?

I use, Options->Advanced->Tools->Import URLs (identify platform and sort in)

>Is doing this a benefit, or is it just as good to let SER find the sites based on the keywords I gave it?

Scrapebox or Gscraper scrapes much much faster than SER. Plus you can and you should remove duplicate domains before you import into SER. Depending on your keywords, duplicate domains can be as high as 95%. Imagine letting SER scrape and not removing duplicate sites, it'll be a waste of time and cpu resources.

Of course, if you're "lazy" and want to "set and forget", then let SER do it. But, then you'll have to pay more for a faster VPS and internet connection.

>Are people out there using Scrapebox?

Yes... only me and loopline...

velsytetra · July 2014

RE:"By create your own list of targeted urls, what do you mean? "
I thoughts that getting lists of urls to target as sites that my content could be posted to was the same thing as scraping? Is that not so? If not, then exactly what does scraping do?

squawk1200 · July 2014

@loopline, I am trying to figure out how I can scrape my own set of URLs, multiple platforms, similar or even identical to the scraped URL lists that can be purchased online at $29 or $39. I know what they are doing to create those lists cannot be rocket science, and between Scrapebox and GSA I have the tools to be able to do that, just do not know how to make it happen?

cherub · July 2014

Have you tried Donald Beck's tutorial squawk?

gooner · July 2014

^^ Thats a great starting point, you'll find some good info in there.

Aside from that, choose an engine, grab the footprints that SER uses by default and scrape them individually to see which yield best results - Focus your efforts on those. You also have to factor the time taken for some types of scrapes.

velsytetra · July 2014

gooner,
What do you mean by "footprints" that SER uses? Where can I find these?

velsytetra · July 2014

cherub,
I looked on youtube for both "squawk" and "Donald Beck" and I came up empty on anything related to the subjects here. Where can I find the video you mentioned?
Thanks,

loopline · July 2014

@jpvr90
I do admit I am a targeted scraper. I can say I am quite adept at footprint building and I dont' need to scrape billions of urls, I mean I guess I could, but I don't have need of them. I can scrape 10 million urls or twice as much in a run and get all the core value I am after, if not with much less.

I used to have a half dozen dedicated servers or there abouts, running tons of instances of Sbox scraping hundreds of millions of urls. But then I just trimmed it all down and dialed it all in. I really don't think I could deal with a billion urls, I mean I don't think I would need to aquire that many for what I do and be able to maintain accuracy.

I guess I should restate in that I used to use scraping for raw urls, now days I have become adept at "tracing" other peoples work and I just find people who are doing great things and utilize scraping to find and examine what they do and dial things in. I don't need to process raw billions of urls, I just go see what everyone else accomplished with their hard work and "borrow" the process, combine it with what already know, bounce it off people in the know and dial in.

I guess I should say that while I have scraped, in the past, billions of urls in total with scrapebox, I am an R&D person. Its how I work, figuring out how stuff works. To do what I do, I can't do it with gscraper, so scrapebox is superior in my opinion because I can not do what I do in gscraper.

I also have never been a "how fast can I get it done" person, so I honestly don't see the value in being "as fast as possible" I have always been able to go as fast as I need with scrapebox, I have a billion things on my plate and I scrape and come back and have millions of urls as fast as I can handle them and Im busy trying to figure out how to dial it in better and get less urls with more results.

Anyway, I can't say that scrapebox is faster then Gscraper, the image above proves it, I can say that often times if you look at the overall amount of time spent dealing with public proxies, recovering uncompleted keywords and such, compared to just loading up SB in a manner where its "appears" slower up front but saves a ton of time in the end, that it can be faster.

I am not really trying to debate about how fast it is side by side with the same setup, I was trying to debate the setup its self. I guess I just always try and say, "how can I do it different to make it better" and I have found that better is whatever takes less of my time to manage and using SB to go slower but more accurate is better then what others use gscraper to go faster.

Possibly Im not making sense, but "faster" is relative to me. So that was my point.

Anyway, which is better aside, I will always recommend SB over gscrper because it allows me to do things gscraper can not allow me to do and I really can only teach people what I know and what I believe is best. Each person is entitled to their own opinion, I have mine, you have yours, and both are relative/subjective.

I think in the internet marketing game there is more then 1 right way. Gscraper and Scrapebox can both be "right" depending on method/approach/end game etc... But I like and think SB better and I always will.

Im a hardcore SB fanboy sure, but I know my stuff too and I can make it dance and teach others, so thats what I do.

@Olve1954
There are advantages to using other engines and timespans, but it depends on what your after. What specifically are you trying to do?

What I mean by use proxies is just don't go to fast. It depends on how many proxies you have and what your trying to scrape, but I either set connections low enough or you can add a delay.

I don't much use advanced operators. In most cases if you dig in, you can build good footprints without operators. In some cases operators are either much more useful or really there is no reasonable alternative due to whatever your trying to scrape. So I go a bit lower on connections when using operators. Really though in many cases if not most you can build a footprint that doesn't require the use of an operator.

@velsytetra
Thanks a loaded question. Scraping is a broad term that more or less means gather data via an automated means. You can scrape yahoo answers, scrape sites for meta data, emails etc.. scrape articles, content etc... you can scrape urls from search engines etc...

So if you want to scrape url targets for platforms where you can post content with links in content, identify which platforms those are in SER, and then get those footprints and scrape against your keyword list.

Go to options >> advanced >> tools >> search online for urls.
Choose add predefined footprint, select the engines and the footprints GSA has built in show up in the footprints box. Copy them to your scraper, merge them with your keywords and away you go.

There are other ways, but probably thats the easiest to start with.

@squawk1200
Its not at all rocket sciense, I have made such lists and sold them for years. I have a video on it in fact:

I am going to make a new updated video when Sbox 2.0 comes out but for now that will get you started.

Olve1954 · July 2014

Hi @loopline,

>There are advantages to using other engines and timespans, but it depends on what your after. What specifically are you trying to do?

I've about 2 mil keywords, which I randomize and use over and over again everyday. Does selecting different google engines (countries) and different timespans, return different results? And hence by rotating them I'm able to scrape more from google?

squawk1200 · July 2014

@loopline, thanks, will check out the video, thanks for the reference.

@cherub, i don't know who donald beck is?

Molex · July 2014

where can i get the gscraper? which is a legit website?

cherub · July 2014

@squawk1200 and @velsytetra this is the fella I'm referring to: https://forum.gsa-online.de/discussion/7406/ultimate-gsa-ser-list-building-video-guide-video-case-study/p1

@Molex http://gscraper.com/ is their legit website

Molex · July 2014

okay thanks!

it seems theres no support forum on their site..

cherub · July 2014

Yes, their support seems horrendous, though if you speak mandarin or cantonese you may get better response.

Kaine · July 2014

I have contact 3-4 times and once to get a refund and it is pretty cool (sales@gscraper.com).

One of their employee even took the time to check a command be mstsc.

velsytetra · July 2014

@squawk1200 and @cherub, thank you for the video information.

loopline · July 2014

@Olve1954

Yes, it would return different results, if you use them every day, perhaps choosing 24h time span would be best, that way you are always only getting whats new, assuming you do it about the same time each day, else you could choose like 1 week and always get the newest stuff. But more engines equals different results, although a lot of bleedover. Also you can use the custom harvester in scrapebox and work with over 20 engines, which will give different results.

squawk1200
Your welcome.

Olve1954 · July 2014

Thanks @loopline. Guess I'll try selecting a weekly (currently Anytime) timespan and rotate engines once per week...

Scraping my own set of targets

Comments