All right - Once the first people has got the program and hopefully has activated it successfully, we will first approach potential bugs and try to correct these...
- Next step is to enhance tab 4 'Various Tools' and add more features + enable progress-bar.
Then we move on and create the article-scraper (some parts are done already).
I also have a plan to develop a 'headless submitter' - i.e feature to post on some selected targets.
Why? Well I think it's possible to approach some platforms differently than GSA SER does.
Well do some testing 'on the side' and figure out if this indeed is possible.
This is a process, an adventure - and over time I think it will be possible to create something really cool;)
What you are saying, is that the 'traditional' way of 'getting articles' is overused, resulting in way to similiar content and duplicates...Also because most scrapers are using the same sources...
So, in order to prevent making the same mistakes - a new approach is needed.
How?
1.
- By feeding the program with 'special footprints' - combined with 'targeted keywords' - avoiding directories.
That would involve making use of a search engine like google or other search engine...
Grab results and filter out the 'bad directories'.
Scrape and deliver content after removing source url and copyright stuff....
Example searching Google with simple footprints:
Perhaps adding some additional filters, like define required length of article content etc...
Semi Automatic:
*The above results could be listed as 'clickable' in a panel, and if content is okay, user can select this [x]
When all targets are selected - finally scrape everything and write out articles....
*Perhaps running different sequences with different footprints and keywords, and present search results prior to scraping everything....
Well, as you will feed the program with a 'known list of targets' - there is no need to open any browser during the process.
It will work similar to 'URL Key Word Scraper' - Using multiple threads to extract the articles.
That will speed up the process rapidly.
During the weekend I plan to start making some more testing and coding of this feature, and as usual I will update the thread during the implementation/Test Results etc.
Simply, because I need some feedback on how it works on a large amount of targets..
And there is 1 more thing to consider too - before adding this last feature to the article extractor.
Performance - Depending on the amount of targets, removal of different things, before writing the text-files could take some time. However it can be done;)
I just need to see, how it works for you guys in 'real life first', before adding advanced 'tweaks':D
So, I suggest to finish up the remaining stuff and simply release v. 1.2 for you guys to try, then we take it from there..
PS:
If there should be an url present in some of the articles - it's not complicated to generate a random url as replacement;)
Pick one randomly---->Now replace existing one with random... (Easy to implement later)
- Prepared code to handle replacement of existing url in scraped text, before making final text-files.
Here is a little demo (please note - it will not be implemented before version 1.21)
Scraped demo text:
When we entered our core range of Aurelia Probiotic Skincare products in to the<p><a href='http://www.aureliaskincare.com/aurelia-tv/'><b>example</b></a> link.</p>Bible to be tested only a few weeks after our launch in January 2013, we could only dream of seeing one of our products in the final published book.
No need to wait buddy - I need feedback first from version 1.2.
Please test the article extractor with at least 50-100 urls, that you yourself has located up front.
- I need to see how it goes for you guys first, before we add the remaining stuff, like copyright removal and url replacement.
Update - Version 1.2 will be released today:
Changelog:
Current donors:
Patience:D I will send a pm to you guys with the new release.
Everyone else:
Please consider to join this adventure, as the development is based purely on interest, support and donations. I don't make money on this project - actually it can not even pay for the electricity;)
Great to hear it worked fine to upgrade to new version:)
Of course I knew it would lead to issues and problems, that is why I have delayed the rest of the features like removal of copyright and url-replacement:P
Let's break it down:
1. The Article Extractor does not use any footprints - as it completely relies on what kind of target urls the user load into the program.
- The point is that the user himself need to do some research up front and do a manual search in google using various footprints, and then select good ones...
That can be done automaticly too - but is not implemented.
Also note, that this would lead to poor quality, as the program wont care if an article is 'good' or 'bad'...
2. Threads
Yep, you are right here - as the program currenty is set to use all 30 threads as default.
Of course I also knew that as well;)
It needs to count the amount of targets first:
- If 5 urls are loaded - 1 thread would be enough
- 100 urls - 10 threads would do
And so on....
Not a big issue really - and easy to implement.
3. Stop Button
Indeed - there is no stop button (yet:P)
- Also no button for loading in replacement urls
As I said - Those features will come in version 1.21;)
The important thing here was to test, if the 'Article Extractor' indeed does work in a real life.
And as far as I see - it does exactly what it is supposed to do (ignoring the features below)
To sum up:
- Balance thread use
- Stop button
- Load replacement urls
- implementation of replace urls, and remove copyright etc...
*Edit
In terms of 'Strange results' like text-fles with nonsense, it will fail on some targets (different encoding and stuff).
However I think most will work, and your test result with 525 articles out of 590 seems decent.
Wow great job! You took it upon yourself to create this essential function. I barely had time to react because you're fast. Thanks for giving it a shot.
- After the release of version 1.3, the focus will be on 3 things:
1. The Article Extractor (Will get some additional features)
2. Some minor fixes and tweaks.
Number 3 is actually not a part of Scraping Tool-box itself - but something new:
An Experimental Add-On for GSA SER, a special add-on that can submit differently than GSA Ser.
Codename: Sentinel
It will be able to 'feed' GSA SER with submitted links, where GSA will take over and handle the remanining.
That means GSA SER will do the rest, and add verified links etc. like it's doing now.
As it is experimental, the platforms it can submit to, will be limited in the beginning - however if things works out great, it will be expanded over time.
Comments
I mean, found website where your software can scrape article easily.
How to get your software? PM sent. Still waiting for your response.
I will get updates ri8?
Then I wanna buy now..
Thanks
PM me.