maybe you can also check the http status of the scraped lists. I noticed that scraping alone is not the hard part.
Maybe you can add that to your list of ideas.
So everything that comes back with 3xx, 4xx or 5xx is deleted or saved somewhere else and everything with 2xx response is saved and delivered to the customer.
And maybe in addition to the FTP upload - can you integrate a dropbox uploader as well. So that the customer can give you access to one specific dropbox folder and then the system uploads the scraped lists onto that dropbox folder.
From my point of view users with your system are buying harvested lists, because of cache function. Your system will scrape many links, store links in cache, and give results to everyone who want to perform scrape. How its different from 100 guys paying one guy to scrape targets? I dont see any difference.
With your service users share results, keywords, footprints (it take long days to find good ones) and have to pay money for this. Just my 2 cents.
@satyr85 - You are somewhat correct. Users aren't actually "sharing" any of these things. The only way a user can get a cached result returned to them is if they use a keyword + footprint that matches a cached result. Meaning -- they still need to provide the keyword + footprint to scrape with and aren't actually stealing from other people's hard-earned footprint lists.
Regarding a bunch of people paying for one guy to scrape -- in a nutshell you are correct. Where we bring value is all the extra processing we do and do it all with automation. If you go that route, you not only have to send your footprints to this one scrape guy, but would also have to receive them from him each day. Also, to scrape at the speeds we are scraping, combined with our cache system, there won't be any one person that can return as many links as fast as we will be able to
BanditIM I like this ideas, however It would be great if there was the option to automatically identify the platforms and create sitelist for direct GSA SER import.
Comments
maybe you can also check the http status of the scraped lists.
I noticed that scraping alone is not the hard part.
So everything that comes back with 3xx, 4xx or 5xx is deleted or saved somewhere else and everything with 2xx response is saved and delivered to the customer.
And maybe in addition to the FTP upload - can you integrate a dropbox uploader as well.
So that the customer can give you access to one specific dropbox folder and then the system uploads the scraped lists onto that dropbox folder.
Thanks.
Looking forward to this service.