@banditim 50$ for 500 million links sounds fare especially if those were deduped links.
Suggestion: how about if you integrate a web crawler that crawls the entire web. Each page the crawler will visit would match it with a footprint. You then build a search engine that searches the crawled pages for keywords and filter the searches with engines.
@bencrabara - Crawling the 'entire web' is a little overkill, but we definitely do have some extra crawling outside of just user's scrapes to allow us to get even more footprints and websites for specific engines
@Kaine - That's extremely odd -- I had typed out a very long response to you and was awaiting YOUR response haha. No need to get bitter - I'll retype it and send the PM again (not sure what happened to it).
How good will it be at importing massive footprint lists? By massive I mean around the 5M mark (something I am about to start on)? And how quick would it return the restults?
Also, when charging for results, will you be charging for the number after or before the dedupe?
@Flembot - We haven't gotten to the point of testing 5 million, but we allow users to copy/paste or Upload files with their respective footprints and/or keywords, so it shouldn't be an issue. Result return speed will vary depending on how many of your search terms we have in our cache. If most of your search terms are in our cache, you could get 100 million results back in a matter of minutes . As for the charging of the results, to keep the comparisons even with Scrapebox and GScraper, we will charge before de-dupe. We obviously know there will be a lot of dupes in there, and will keep that in mind, but we want to make the perfect comparison with the other tools to prove we WILL be cheaper and more efficient than the others.
@spammasta - Not yet. We still got a couple weeks to make sure the whole system is ready to go. I wanted to make this thread now to get our beta testers signed up. I will stop accepting new beta testers fairly soon.
@BanditIM - interested to see the pricing then, because with gscraper you can use the free version to scrape all day long. With dedupe enabled it is really hard to reach the url limit.
Will there be the ability to filter results? For example: I doubt anyone wants to get google webcache results at all.
@Ferryman - With the free version you still have to provide your own proxies and server costs though. If you're talking about just using like 1 or 2 threads to scrape with your plain IP, I think that's a far stretch haha. We may offer like 10,000 free scrapes for all users each month or something so it is comparable to that free scraping.
As for filtering results, we'll add in all those extra features as we progress. They are simple and easy to add in, but just take some time because there are so many of them to add. We want to get the core idea down before getting too far ahead of ourselves. I do have to ask though -- I've been scraping for years and haven't encountered my scraped results coming back with webcache results... does Google somehow show this in the searches? Mind giving an example?
The idea is of caching looks great, since google is getting smarter with proxies every single day. Though it comes to refresh rate of cached queries. If you are going to call it like 2 weeks to 1 month they will be quite useless.
@BanditIM - Yes, ofc you have to provide your own proxies Still, for the $30 mentioned above you get enough reverse proxies to use all day long. For $100 I wouldn't even bother getting the service unless it is really phenomenal.
About the webcache - weird, I am geting them a lot on gscraper (every third result or so). Doesn't really bother me as I just dedupe according to unique domains.
Would be nice if there was an option to get weekly, monthly, daily etc results so you could just get the fresh ones instead of scraping the same thing over and over again.
@derdor - Completely agreed. We plan on monitoring certain popular footprints (i.e. - "powered by wordpress") and see how often the search results update when using a handful of keywords with those footprints. Right now we're seeing around 3-10 days will be a good number to start off with on the refresh rate.
@Ferryman - Regarding reverseproxies, of course you can scrape all day long with those, but $30 gets you like 10 ports... once we do some case studies of certain thread amounts, we will have a good idea of what we will need to charge to be competitive. 10 ports, or 1000 ports, it'll all scale pretty evenly, so the number of links you can scrape in a month using that route will be less than the number of links you can get with us at the same price . Also note, we see our service as little more premium (but won't charge for it) due to the fact it can auto-scrape 24/7 and auto-FTP, something the other software's cannot do.
Just want to keep everyone in the loop, we haven't forgotten about you guys . With the upcoming release of our new text captcha system, the scraper has been put down one priority, but it's very very close! Check out the easy-to-use dashboard that'll get you scraping links 24/7 in a matter of a couple of minutes:
@banditIM after trying your email service and spin service ( too bad you removed it ), I'm waiting to get my hands on this. I always had trouble with scraping and proxies so this will be the better option for me. On a completely unrelated note the ads in the screenshots show south Indian actress
Comments
Scraping lists is always a very time and power consuming process, so this service can be really impactant!
How good will it be at importing massive footprint lists? By massive I mean around the 5M mark (something I am about to start on)? And how quick would it return the restults?
Also, when charging for results, will you be charging for the number after or before the dedupe?
@sumusiko - Sweet, great to hear!
@spammasta - Not yet. We still got a couple weeks to make sure the whole system is ready to go. I wanted to make this thread now to get our beta testers signed up. I will stop accepting new beta testers fairly soon.
As for filtering results, we'll add in all those extra features as we progress. They are simple and easy to add in, but just take some time because there are so many of them to add. We want to get the core idea down before getting too far ahead of ourselves. I do have to ask though -- I've been scraping for years and haven't encountered my scraped results coming back with webcache results... does Google somehow show this in the searches? Mind giving an example?
@Peisithanatos - Great man, thanks for the interest!
Though it comes to refresh rate of cached queries.
If you are going to call it like 2 weeks to 1 month they will be quite useless.
@Ferryman - Regarding reverseproxies, of course you can scrape all day long with those, but $30 gets you like 10 ports... once we do some case studies of certain thread amounts, we will have a good idea of what we will need to charge to be competitive. 10 ports, or 1000 ports, it'll all scale pretty evenly, so the number of links you can scrape in a month using that route will be less than the number of links you can get with us at the same price . Also note, we see our service as little more premium (but won't charge for it) due to the fact it can auto-scrape 24/7 and auto-FTP, something the other software's cannot do.