All right - Once the first people has got the program and hopefully has activated it successfully, we will first approach potential bugs and try to correct these...
- Next step is to enhance tab 4 'Various Tools' and add more features + enable progress-bar.
Then we move on and create the article-scraper (some parts are done already).
I also have a plan to develop a 'headless submitter' - i.e feature to post on some selected targets.
Why? Well I think it's possible to approach some platforms differently than GSA SER does.
Well do some testing 'on the side' and figure out if this indeed is possible.
This is a process, an adventure - and over time I think it will be possible to create something really cool;)
What you are saying, is that the 'traditional' way of 'getting articles' is overused, resulting in way to similiar content and duplicates...Also because most scrapers are using the same sources...
So, in order to prevent making the same mistakes - a new approach is needed.
How?
1.
- By feeding the program with 'special footprints' - combined with 'targeted keywords' - avoiding directories.
That would involve making use of a search engine like google or other search engine...
Grab results and filter out the 'bad directories'.
Scrape and deliver content after removing source url and copyright stuff....
Example searching Google with simple footprints:
Perhaps adding some additional filters, like define required length of article content etc...
Semi Automatic:
*The above results could be listed as 'clickable' in a panel, and if content is okay, user can select this [x]
When all targets are selected - finally scrape everything and write out articles....
*Perhaps running different sequences with different footprints and keywords, and present search results prior to scraping everything....
Well, as you will feed the program with a 'known list of targets' - there is no need to open any browser during the process.
It will work similar to 'URL Key Word Scraper' - Using multiple threads to extract the articles.
That will speed up the process rapidly.
During the weekend I plan to start making some more testing and coding of this feature, and as usual I will update the thread during the implementation/Test Results etc.
Simply, because I need some feedback on how it works on a large amount of targets..
And there is 1 more thing to consider too - before adding this last feature to the article extractor.
Performance - Depending on the amount of targets, removal of different things, before writing the text-files could take some time. However it can be done;)
I just need to see, how it works for you guys in 'real life first', before adding advanced 'tweaks':D
So, I suggest to finish up the remaining stuff and simply release v. 1.2 for you guys to try, then we take it from there..
PS:
If there should be an url present in some of the articles - it's not complicated to generate a random url as replacement;)
Pick one randomly---->Now replace existing one with random... (Easy to implement later)
- Prepared code to handle replacement of existing url in scraped text, before making final text-files.
Here is a little demo (please note - it will not be implemented before version 1.21)
Scraped demo text:
When we entered our core range of Aurelia Probiotic Skincare products in to the<p><a href='http://www.aureliaskincare.com/aurelia-tv/'><b>example</b></a> link.</p>Bible to be tested only a few weeks after our launch in January 2013, we could only dream of seeing one of our products in the final published book.
Comments
I mean, found website where your software can scrape article easily.