Keyword Generator - Made in Java - For Scraping

magically · April 2015

All right - Once the first people has got the program and hopefully has activated it successfully, we will first approach potential bugs and try to correct these...

- Next step is to enhance tab 4 'Various Tools' and add more features + enable progress-bar.

Then we move on and create the article-scraper (some parts are done already).

I also have a plan to develop a 'headless submitter' - i.e feature to post on some selected targets.

Why? Well I think it's possible to approach some platforms differently than GSA SER does.

Well do some testing 'on the side' and figure out if this indeed is possible.

This is a process, an adventure - and over time I think it will be possible to create something really cool;)

magically · April 2015

Ohh I forgot in the previous message...

Here is another Book in Spanish to get you going:

Spanish Book

NB: Make sure to select the proper sorting algorithm---->Latin, before hitting 'Go' button

magically · April 2015

Work in progress for next release:

- Split file (large file)

I have prepared the code for 'Split File' - which will be added as a feature in next release.

Early raw demo - without implementation in the GUI:

In this sample a file of 349MB is split into 4 parts. The source file could have been significant larger - however this is just a sample/demo.

magically · April 2015

Articles Scraping - Sources

I would appreciate if someone could mention a list with some good sources, besides ezinearticles.

A list with 10-15 sources would be a good starting point!

I need some sources to work on;)

Kaine · April 2015

I think it's better if you can see for website article, no directory. To much user/duplicate time after time.

After you see what is good, simply share footprint used for find good source for your software

magically · April 2015

@Kaine

Okay, let me see if I understand you correctly...

What you are saying, is that the 'traditional' way of 'getting articles' is overused, resulting in way to similiar content and duplicates...Also because most scrapers are using the same sources...

So, in order to prevent making the same mistakes - a new approach is needed.

How?

1.

- By feeding the program with 'special footprints' - combined with 'targeted keywords' - avoiding directories.

That would involve making use of a search engine like google or other search engine...

Grab results and filter out the 'bad directories'.

Scrape and deliver content after removing source url and copyright stuff....

Example searching Google with simple footprints:

Perhaps adding some additional filters, like define required length of article content etc...

Semi Automatic:

*The above results could be listed as 'clickable' in a panel, and if content is okay, user can select this [x]

When all targets are selected - finally scrape everything and write out articles....

*Perhaps running different sequences with different footprints and keywords, and present search results prior to scraping everything....

2. Other suggestions or strategies are welcome

- Please add suggestions, strategies;)

Kaine · April 2015

magically

I mean, found website where your software can scrape article easily.

Example <article> .... </article> in html5.

Then, found good Footprints for locate that on the web and users scrape this urls for insert into the software.

For exemple footprint of wicked article creator (not good directory):

site:goarticles.com +

site:ezinemark.com +

site:examiner.com +

site:voices.yahoo.com +

site:articlebiz.com +

site:articletrader.com +

site:a1articles.com +

site:articlesnatch.com +

site:pubarticles.com +

site:articlealley.com +

site:ezinearticles.com +

site:buzzle.com +

site:selfgrowth.com +

site:brighthub.com +

site:suite101.com +

site:isnare.com +

site:articlecity.com +

site:articlerich.com +

site:ideamarketers.com +

site:articleslash.com +

site:articlepool.com +

site:abcarticledirectory.com +

site:searcharticles.net +

site:streetarticles.com +

site:articlealley.com

site:articlecube.com +

site:sooperarticles.com +

site:bukisa.com +

site:infobarrel.com +

site:gather.com +

site:isnare.com +

Maybe:

site:wordpress.com +

site:blogger.com +

.....

site:wordpress.com + diet

can return good result.

magically · April 2015

@Kaine

Hmm...I think we are talking about the same:D

If you look at the image above, you will see some text high-lighted in green: "blog" "skin care"

That would be the search term = footprint + keyword

It can actually handle your suggested footprint: site:wordpress.com + diet

-->Example: If 5 footprints + keywords are given, it will repeat the seach with all footprints + keywords

The real question would be if the user should have a chance to view the article, and if found good --> select it.

For instance if 25 results are shown, the user finds 10 suitable and select those.

It will then do the job, scrape the articles and output the text-files.

- Or should that process be 100% automatic?

-----------------------------------------------------------------------------------------------------------------------------------------

Different Option:

Perhaps you are suggesting to simply feed the scraper with urls you have found up front?

So, you would ask the scraper to load a list with targets, and simply scrape those?

Meaning if you feed it with 50 urls - it will scrape these and deliver the articles as output

Kaine · April 2015

magically

"Perhaps you are suggesting to simply feed the scraper with urls you have found up front?

So, you would ask the scraper to load a list with targets, and simply scrape those?

Meaning if you feed it with 50 urls - it will scrape these and deliver the articles as output"

Yes exactly, soft only visit urls and grab article, no scrape urls.

magically · April 2015

Yep, I think I got your point right now:D

Most users know how to make a decent footprint, so they can also do their own searching in google...

So here is the scenario:

1. Users makes their own target list (using their own footprints and keywords) example: site:wordpress.com + diet

2. When they have collected enough information - they make the list.

3. They load their list into the program.

4. Program extracts all articles and write out text-files

In other words:

Program must have a feature to do the following:

- Load targets (reference to articles of course)

- Extract articles for each loaded url

- Write article to file, for each url - without source information and copyright info.

Is that correctly?

magically · April 2015

Update Work in progress for next release:

- Implementation of 'SPLIT FILE' in the GUI (**See previous above, code is ready)

- Prepared code for 'REMOVE DUPLICATE DOMAINS' (Will be implemented in the GUI)

Obviously GSA SER does not calculate it correctly...:D

Scraping Tool-Box removes all junk files (.pdf, .xml,.chm etc) + Removes Duplicate Domains in this new algorithm.

That will give users an opportunity to keep only UNIQUE URLS or Remove Duplicate Domains

Compare Of Cases:

Source file contained 4.615.209 urls

Remaining Targets left: 724.826 (Scraping Tool-Box)

Remaining Targets left: 713.554 (GSA SER)

How come GSA Ser has less results, considering it doesn't remove junk files during dedupe?

@Sven Could it be a bug:))

-Preparation of Article Scraper and Implementation

magically · April 2015

@Kaine

Here is a very basic proto-type of the article extracter:

Program was loaded with target:

http://www.articlesfactory.com/articles/travel/ideal-destinations-in-vietnam-this-hot-summer.html

Site was visited and during visit, the article was extracted.

Finally, it was printed out (console for demonstration)

I hope it was something like that you have in mind?

Kaine · April 2015

magically

Yes is that

you think is possible to lunch multiple page in same time and scape after ? (for no wait loading time).

magically · April 2015

@Kaine

Great to hear:D

Well, as you will feed the program with a 'known list of targets' - there is no need to open any browser during the process.

It will work similar to 'URL Key Word Scraper' - Using multiple threads to extract the articles.

That will speed up the process rapidly.

During the weekend I plan to start making some more testing and coding of this feature, and as usual I will update the thread during the implementation/Test Results etc.

JudderMan · April 2015

Thanks to @Kaine for pointing me at this thread, I hadn't seen it before.

@magically good work dude, this looks a very handy tool.

magically · April 2015

@JudderMan

Many thanks for your kind words, really appreciated:)

- And indeed thanks to @Kaine as well for support, ideas and feedback.

magically · April 2015

-Work in progress:

Early preview of upcoming new feature - Split File

Not completed yet, still some heavy coding left to do....

Once this feature is fully implemented, the work of the article scraper will be initiated.

redfoxseo · April 2015

Can't wait to get this software. SHOW YOUR MAGIC...

(

magically · April 2015

- Completed - Split File (Included in next release...)

Ability to select varios units:

Selection of target file that needs to be split:

Calculation of File Size in done 'on the fly'....

Process Initiated:

Task Completed:

Result:

Moving on to next feature - I Plan to start doing it during the weekend (if time allows me to do it:D):

- Work in Progress: Article Extracter

Stay Tuned;)

Kaine · April 2015

Is the best feature for me ^^

magically · April 2015

@Kaine

Indeed buddy;)

Very early GUI MOCK UP (Can't still change a lot)

Will see if I can get some time during the weekend, to make the code and enhance the GUI here...

Stay tuned for progress and updates during the weekend and the upcomming week

magically · April 2015

- Added Detection of File Encoding Type (Under Various Tools Tab)

It will detect which encoding a text-file is using 'on the fly' - really blasting fast!

It can detect the following encoding types:

Chinese:

ISO-2022-CN

BIG5

EUC-TW

GB18030

Cyrillic:

ISO-8859-5

KOI8-R

WINDOWS-1251

MACCYRILLIC

IBM866

IBM855

Greek:

ISO-8859-7

WINDOWS-1253

Hebrew:

ISO-8859-8

WINDOWS-1255

Japanese:

ISO-2022-JP

SHIFT_JIS

EUC-JP

Korean:

ISO-2022-KR

EUC-KR

Unicode:

UTF-8

UTF-16BE / UTF-16LE

Others:

WINDOWS-1252

magically · April 2015

Update - Almost completed the Article Extractor

A small demo:

1. We find 3 random targets using this footprint: site:wordpress.com + skincare

In this case, the following were picked:

https://signatureskincarebyshana.wordpress.com/

http://www.makingskincare.com/how-to-make-a-lotioncream-part-1-equipment-and-ingredients/

http://solaskincare.ca/

Here is the small text-file with the 3 targets above:

Let's load them into the Article Extractor:

Let's hit start and see what happens...

Now we look at the destination folder:

Indeed 3 articles has been extracted and generated:D

Sample from article2:

To Do before release of Scraping Tool-Box 1.2

- Minor adjustments in the GUI

- Implementation of other minor stuff

- Compile the program

- Launch;)

Expected timeframe for version 1.2:

5-7 days

magically · April 2015

- Added Korean Language Support for Keyword Generator

As I can't read Korean - here is a translation:

magically · April 2015

- Adjusted logfile:

magically · April 2015

-Adjusted 'Article Extractor' GUI even futher:

- Please see the previous entry - Article Scraper completed

Just needs very small fixes - and it's done;)

magically · April 2015

- Fixed formatting issues in 'URL Keyword Scraper'

Still to do, before release of version 1.2:

- Minor adjustments & Enhancements

Kaine · April 2015

magically

And that clean copyright ? Maybe if that can delete time of article is posted can be good with possibility to change url in article.

magically · April 2015

@Kaine

hehehe:P

Well, that will be added in v. 1.21

Simply, because I need some feedback on how it works on a large amount of targets..

And there is 1 more thing to consider too - before adding this last feature to the article extractor.

Performance - Depending on the amount of targets, removal of different things, before writing the text-files could take some time. However it can be done;)

I just need to see, how it works for you guys in 'real life first', before adding advanced 'tweaks':D

So, I suggest to finish up the remaining stuff and simply release v. 1.2 for you guys to try, then we take it from there..

PS:

If there should be an url present in some of the articles - it's not complicated to generate a random url as replacement;)

Pick one randomly---->Now replace existing one with random... (Easy to implement later)

http://www.yoururl1.com

http://www.yoururl2.com

http://www.yoururl3.com

http://www.yoururl4.com

PSS:

I also hope to see more guys interested here -After all we are all in the same boat, so why not help eachother;)

magically · April 2015

- Prepared code to handle replacement of existing url in scraped text, before making final text-files.

Here is a little demo (please note - it will not be implemented before version 1.21)

Scraped demo text:

When we entered our core range of Aurelia Probiotic Skincare products in to the<p><a href='http://www.aureliaskincare.com/aurelia-tv/'><b>example</b></a> link.</p>Bible to be tested only a few weeks after our launch in January 2013, we could only dream of seeing one of our products in the final published book.

Demo shows: The existing url is replaced with: "http://www.SomeUrl.com/"

Question: Will it work on any URL?

Answer: Most likely not - However it will cover and handle quite a lot;)

Keyword Generator - Made in Java - For Scraping

Comments