Some really basic questions about SER
So, I've been using SER for some time, but I'm still lacking the knowledge of some very basic things. I thought to ask them here for some clarification.
1) What is a thread, really?
Let's say you have 1 project for the sake of simplicity... If you're running 500 threads, does it mean that SER will either be trying to submit to 500 links from sitelists or scanning search engines for 500 keywords or what?
2) What's the difference between thread count / timeout in global options versus proxy options?
I don't understand what those two mean. I think the global option timeout means how long SER will wait (in one thread?) for a website to send the first bit of data, right? Well, what does the proxy timeout mean then? How about proxy thread count?
3) What causes "download failed" logs?
I'm seeing these quite a bit. Does it mean that either
1) the website is down
2) it takes too long to load
3) it hit the proxy / global options timeout limit?
4) What's a proper amount of threads to run?
Now, I know this is dependent on VPS and million other settings. I'm just curious as on how this relates to the amount of proxies. Like if you're not using SER for scraping at all, can you radically raise the thread count?
5) Using imported lists
Sorry, I already posted a question about this, but let's have another run... I'm mainly curious about the engine selections when importing lists...
I mean, you're not scraping for targets when importing lists. Wouldn't it then make sense to check ALL engines and try to post to as many of them as possible? I've understood that if you don't have articles checked - as an example here - and you bring a list of 100k article urls, SER won't post to any of them, right?
Conversely, let's say you've been unable to post to a certain article engine when scraping for targets. Well, wouldn't it make sense to still tick that on when you're importing lists, in case it would succeed this time around?
I know these are really n00b questions, but I just don't understand these nevertheless
Would be awesome to hear input from guys like @Sven, @ron and other guys!
1) What is a thread, really?
Let's say you have 1 project for the sake of simplicity... If you're running 500 threads, does it mean that SER will either be trying to submit to 500 links from sitelists or scanning search engines for 500 keywords or what?
2) What's the difference between thread count / timeout in global options versus proxy options?
I don't understand what those two mean. I think the global option timeout means how long SER will wait (in one thread?) for a website to send the first bit of data, right? Well, what does the proxy timeout mean then? How about proxy thread count?
3) What causes "download failed" logs?
I'm seeing these quite a bit. Does it mean that either
1) the website is down
2) it takes too long to load
3) it hit the proxy / global options timeout limit?
4) What's a proper amount of threads to run?
Now, I know this is dependent on VPS and million other settings. I'm just curious as on how this relates to the amount of proxies. Like if you're not using SER for scraping at all, can you radically raise the thread count?
5) Using imported lists
Sorry, I already posted a question about this, but let's have another run... I'm mainly curious about the engine selections when importing lists...
I mean, you're not scraping for targets when importing lists. Wouldn't it then make sense to check ALL engines and try to post to as many of them as possible? I've understood that if you don't have articles checked - as an example here - and you bring a list of 100k article urls, SER won't post to any of them, right?
Conversely, let's say you've been unable to post to a certain article engine when scraping for targets. Well, wouldn't it make sense to still tick that on when you're importing lists, in case it would succeed this time around?
I know these are really n00b questions, but I just don't understand these nevertheless
Would be awesome to hear input from guys like @Sven, @ron and other guys!
Comments
2) You're right about HTML timeout & proxy time out is how long before SER decided that the proxy is dead. Threads is how many to use when you test your proxies
3) Yep, all of them. You get this whenever the site can't be accessed for whatever the reason
4) Depends on your setup, but should never be more than 10 x proxies e.g. 50 proxies = 500 threads. You can run this higher if you don't use SER to scrape and lower if you are scraping and don't want Google to ban your proxies
5.1) Yes you can tick everything and if it's unticked it won't post to them
5.2) You could do that, but you might be wasting SER's resources that could be used on other platforms to def get a link.
I think I just experienced some sort of Satori! Thanks a lot!
Just to clarify... So the proxy thread... Let's say I have 30 proxies. By having a thread count of 10 would then mean that SER can test out 10 at a time, leaving 20 for "real" use? So if that was 30, SER could then theoretically halt everything else for the time it tests proxies?
Continuing on the theme of importing lists...
Wouldn't keeping all the same engine selections lead ultimately into having a verified list full of duplicates? I mean, having at least one project trying post to EVERYTHING - wouldn't that be wiser as to the development of a verified list?
And here's one more...
6) Using verified list
I'm also kind of curious when I should drive the verified list into a project? I mean... If I have a list of my own and it's not shared with anyone, how often do you think it's okay to use it? I'm mainly thinking about leaving a footprint if each and every single project has like 95% the same links...
In case someone's curious as well on the what engines to check...
I just made a test on one of my SER's.
The other half I ran ALL the engines checked. On the other one, I chose only those SER has successfully posted and verified.
Results?
ALL engines: around 5 LpM
Verified ones: 115 LpM.
Guys, know your stats.
I wouldn't worry about footprints too much with SER - providing you've got a big enough list.
Even if you hit two projects with the exact same list, some sites would give you a link to one project and not to the other, different links will stick and different ones will die.
Recently I've been using a variant of what the SERList chaps suggest you should do, but until I started doing that I was using identified and verified on all projects.
I think it would be acceptable to at least use portion of verified links and then the rest from imported lists or something..
Lately I've been focusing exclusively on using SER on my SEO efforts, so the learning curve is quite steep.
Thanks again for great answers!
One more question!
When I'm importing URL's... Let's say I have a list of 100k, and 10 projects.
Now, if I paint all those projects and import that way, is SER going to import 100k into every selected project, or spread the 100k evenly accross all projects?
Is there a smart way to sort scraped urls into engines that SER recognizes and can post to?
Wouldn't that be a more effective way to import lists? Right? To have them be in a folder, as an example, and then put SER to read from that folder?
With 30 projects that would cut down the time it takes for ever project to figure out what engine a url represents and whether it can even try to post there...
I think @ron mentioned somewhere having a project post to some trash url and then write into that folder? How do I make it so that nothing else is written into that folder?
Something like the folder option for auto processing lists is in the works according to @sven, but it will still need to check the sites against it's footprints to decide what platform the site belongs to.
Unless @ron knows a better way, you'd have to create a new folder, set it as verified and then run those projects exclusively until the import list had processed. You do that a couple of times to make sure SER has found everything it can.
What do you think @davbel, would it be better to put all the successful ones into that folder? Or verified?
I have a couple more questions.
1) I think you said somewhere you have a SER of it's own doing just sorting the lists, is this correct?
2) Why do you run these "trash projects" through them? Wouldn't it make more sense to put actual projects in there and grab the links for those as you go?
3) How would you run a scenario like this:
You have scraped and cleaned 5 lists:
- 36k
- 36k
- 56k
- 56k
Would you make own project for all of these, or put them all in the same project?
4) What kind of settings do you use?
I mean... Do you put the thread count a lot higher, as an example? Any other major changes on settings?
This is so awesome. I need to forget about just dumping the lists into projects as they are, as every project then needs to go through the same trouble of finding what works and what doesn't etc.
Thank you for your answers!
Or are you talking about how @ron uses lists on his SER installations?
The processes used for each is different.
So... Here's where my understanding is right now:
- scrape and clean lists
- create a couple of projects in SER, and import the lists
- have SER write on a new verified folder (I just create one on my desktop)
- after SER is done, I figured I'd make that folder be "indentified" as an example, and then import this "identified" site list into projects
Am I on the right track at all here?
You are on the right track for sure. We do the same thing, just on a larger scale.
Okay, I'll just put 4 projects into both of my SER's right now, and drive couple of hundred thousands links through them. Let's see how it goes
Oh hey by the way... What kind of verified percentages do you guys typically get when running scraped islts through SER? 10%?
I have around half a million urls going through it right now. Super excited to see how it ends up!
And here's another question... When do you know SER has done everything it can for those lists? Do you wait for the "no more targets to post to" message, or do you watch the remaining urls or what?
For whatever reason, it seems like SER leaves couple of hundred urls just hanging in there, and there they sit then, not giving out the "no more targets to post to" message.
@JudderMan, yeah I think I have around 400k urls right now - they're running on 2 SER's and divided between 8 projects. So, what's that - around 50k per project or so?
Hey do you guys put the identified and successful into their own folders as well? Could re-importing the identified and successful be any good in this case?
I must say that the 1% does sound a bit low for me... I was expecting something more to the tune of 10%... How about you @JudderMan? What kind of verified-% are you seeing?
But we have worked really hard on the footprints to get to that %. It's a long long process.
So we are always tweaking the footprints to narrow down on scraping only the sites that SER can post to.
It's not possible to scrape only those URLs of course, but we try and improve the % all the time.
Personally, when i see a project has less than around 10,000 links - I throw another list in there, so it never gets too low.
I split the list into 100k chunks as @judderman suggested.
EDIT: Those figures are the submitted - verified %
1% from scrape to verified is probably about right. Not totally sure on that figure off-hand.
Yeah, that footprint thing is next on my "things to learn" list. I'll just get going with the footprints SER is currently being able to post to, and once I start getting not-that-stellar results, I'll look into those footprints a bit more.
As far as submitted to verified, again this depends on the platform, but 5-10% would be bad and 50-70% would be good.
As @gooner says, getting the footprint right is pretty much the key think to scraping. This is where you will have the biggest improvements in the number of verified.