Verified Site List - Not Verified at All?
hey guys,
quick question for you. the last few weeks i've been scraping like a mad man to create a solid verified site list my projects can use. the other day i thought why not give it a test run and create a dummy project, just to see how high i can get the VPM up. the results have actually been quite good, but i'm still not satisfied, at all.
i've noticed something very strange - while only using links from my verified site lists, i got a lot (at least a lot for a verified list) of errors i usually only get when going through a freshly scraped url list (no form at all, no registration page, xxx expected, unknown platform, ..). this really got me riddled.
then i actually thought of one reason that may be responsible for this: maybe some of these sites are temporarily unavailable, have changed their layout, engine, etc., etc. and therefore aren't what they were before.
would you consider this to be a very possible reason, too, or is there something wrong with my list/settings?
quick question for you. the last few weeks i've been scraping like a mad man to create a solid verified site list my projects can use. the other day i thought why not give it a test run and create a dummy project, just to see how high i can get the VPM up. the results have actually been quite good, but i'm still not satisfied, at all.
i've noticed something very strange - while only using links from my verified site lists, i got a lot (at least a lot for a verified list) of errors i usually only get when going through a freshly scraped url list (no form at all, no registration page, xxx expected, unknown platform, ..). this really got me riddled.
then i actually thought of one reason that may be responsible for this: maybe some of these sites are temporarily unavailable, have changed their layout, engine, etc., etc. and therefore aren't what they were before.
would you consider this to be a very possible reason, too, or is there something wrong with my list/settings?
Comments
There's just one thing I didn't quite get. Do you keep your original verified folder or do you delete it, once you've gone through it and have extracted the gold nuggets, like you call them. Because to me it sounded like you keep it, but I don't see the reason why.
Let's just say that 50% of your original folder have passed and moved on to your new 'gold nugget verified' folder, then your original verified folder would only consist of exactly these links (and therefore be skipped, unless you allow multiple posts on the same site) and the other 50% would be links of quality close to garbage, because many of them have become permanently unavailable and a small percentage may only be temporarily unavailable. But all in all, pulling links from your nugget folder AND your former verified folder seems kind of redundant to me.
But then again, maybe I totally misunderstood you there.
Nah, you throw it away. Your verified file is only as good as it is current. You processed the entire old file, extracted the gold, and now you throw away the old one - it has no value.
That is why you want to run that old verified file through several spam tiers. You want several swings at the plate for each link, and if 3 (or more) projects can't make a link work from that old verified file, then who are you kidding, right? Time to take out the trash.
You want to import it as it will try to post to each URL in order (as opposed to ticking a folder in projects:sitelist which chooses targets randomly - and often repetitively). You simply want nothing in the target URL cache of these spam tiers before you import the old verified file. You want no sites lists checked. You want no search engines. You want all engines checked. You want to only process that list and nothing else. And when it is done, bam, the icon shows up, and you are finished.
And yes, turn on "post to same domain more than once" or however it is worded. In fact, from these projects, you should also be deleting all target URL history. And to take it even one step further, use all new emails, as in 10 new ones for each project. You want the maximum efficiency to make every link possible with this type of processing. This is obviously a junk tier that makes no direct links to any of your moneysites.
Its always good to subscribe for verified lists.
Plus i really like SER inbuilt scraper too
Just 2 really quick questions
"That is why you want to run that old verified file through several spam tiers. You want several swings at the plate for each link, and if 3 (or more) projects can't make a link work from that old verified file, then who are you kidding, right? Time to take out the trash."
That's what you wrote earlier. You didn't explicitly mention (in your step by step tutorial) to take several swings at one URL. Should I tick 'Continously post to same URL even if failed before' for these spam tiers, or rather create 3 projects for the same chunk of links? You got me a little confused on that particular detail.
Is there a particular reason why you suggest importing the file chunks through one of the global site list folders (the failed one in this case) instead of simply splitting the file into 10 chunks and then manually import them right clicking each of the 10 projects?
Again ron, words can't even begin to describe how helpful you are. I appreciate this so much, I really do.
I'm not using any filters whatsoever. All I did was untick a few engines I never use. I deduped my 350k list on a regular basis. I don't even know what to say right now. I thought I might lose 10-30%, hell maybe even 50% - but 90% (!!!).
I'll wait for the 5 new projects to finish and then evaluate the results again.
I do exactly what @ron said on a weekly basis. I have two dedicated servers scraping 24/7, between 25,000 and 100,000 URLs per minute each. I run all of those scans in multiple servers to find good stuff. I just finished my new master list and it has 12,000 unique domains (no blog comments, indexers, or exploits).
12,000 unique domains doesn't sound like much, but I get number similar to @ron because it's a 100% clean list.
I mean if even people like you who scrape like their lifes depend on it can't sustain a big list with more than 200-300k links, than how am I supposed to pull that off?
My plan was to scrape enough so I have a basic verified list with 200-300k verifieds which were going to be used for my ranking projects. But since I'm now left with ~30k URLs (of which ~50% or even more aren't even dofollow) I don't even know how I'd create a tiered structure, because I simply don't have enough links.
I care about one link per domain. I don't want to post 25 links on the same domain, although that would definitely make my numbers look better. I don't believe that certain types of links provide results so I exclude them even though that would inflate the numbers. My point is that results are more important that numbers.
From what you guys have told me I get the strong feeling that I should rethink my scraping strategy. Right now I'm using no filters on my scraping footprints and my goal is to scrape as much as possible.
But now I'm thinking I should filter out all the unneccessary NoFollow platforms and additionally focus on contextual dofollow links. I'll still get a lot of NoFollow links for diversity, but right now (if not applying any filters at all) I get 25% df and 75% nf.
Right now I'm trying to figure out how to properly build links and what to focus on and how to do it exactly. Because my current strategy obviously doesn't work that well, since 90% of my scraped links die after a couple of weeks. Repetitive posting to the same domain/URL definitely makes things easier and more effective. And I need to be efficient, because there's nothing I hate more than being inefficient and wasting my time.
Now, when you talk about 'lower tiers' are you referring to your secondary links that point at your CDF tiers, or does this include your lower CDF tier(s)? Or does every single link in your CDF pyramid come from a unique domain? Just wondering, because that'd be hard to pull off, I guess.
Regarding the dofollow matter - I wasn't going to shoot 99% dofollow links at my money site, but I'd rather have as many as possible. Especially since even if you uncheck typical nofollow engines, you still end up with ~50% nofollow links.
@Tixxpff - Yeah, unchecking the nofollow creates less links. But if you have things spinning at a high LPM, then I say go for it. You are right, no matter what you uncheck, you are still going to get a fairly balanced blend. So I like it.
With my secondary links, I was referring to T2. I was babbling that if I could possibly do it, I would love to have unique domains all the way through the pyramid, but it is incredibly impractical. I think some people try to do that in practice - but I am sure they make a lot less links (as well as constantly running out of targets, if done on any kind of scale).
Overall, I like how you worded it. Insanity is doing the same thing and expecting different results. Change it up dude!