Keyword Generator - Made in Java - For Scraping

magically · April 2015

Hello eveyrone:)

Well I was kind of borred:P

So I made a small Tool, that is able to generate several UNIQUE keywords - something that can be used for Scrapebox and GScraper...

Nothing Fancy - Just plain and very simple.

It's still a kind of a Proto-Type - More development will be made (More Useful tools will be added overt time)

What is does:

It's a java program (Can run on multiple platforms) - It will read a large textfile, like a book.

Then it will make a UNIQUE list from that book/sourcefile.

No dupes - Just Unique Keywords - That may be used for scraping or other stuff...

As I did spend a few hours making it - A small donation of 5$ for each purchace will be hugely appreciated.

If you are interested in this small tool (prepare to make a small donation and recieve it soonish) - feel free to send me a PM.

I expect the small tool to be ready for final launch in about a week from now..

Minor adjustments needs to be implemented before final release;)

Normally I would never approach a forum and sell something - consider this as an exception - an offer for those who are interested.

Comments are welcome of course.

magically · April 2015

- added Execution Time for the entire operation:

In this case - It took 42 Seconds to generate 14098 Unique Keywords - Based on a book containing 361.961 words.

That is pretty damn fast for the record;)

young_gooner · April 2015

Good job! Keep it up mate!

magically · April 2015

@young_gooner

Many thanks;) More is in the sleeves - some additional tools to do the trivial work and make things more easy.

I was really tired of looking for keywords - so I made this little thingy...

Surely there will be some "slow" keywords in the generated lists - However tweaks for handling such things can be implemented later.

Kaine · April 2015

Nice, possible to grab website keyword (with url) ?

Can be good for work on competitor.

magically · April 2015

@Kaine

Indeed a good suggestion - Will be added too in future updates.

- Add a list with site urls

- Grab Kewords from Targets

- Finally sort and make a final list based on results.

delta_squad · April 2015

That could be handy. Does it work only with English language?

magically · April 2015

@delta_squad

That was indeed a good question!

Right now the actual sorting handles English words - However you are absolutely right in your observations.

Several different sorting algorithms must be implemented to handle various language - where the user simply select which sorting they want to run on the target file.

Not so complicated to implement - just takes some time to add support for various languages.

And yes, I will add support for this as well;)

delta_squad · April 2015

Awesome! Right now I'm using furykyle's keyword lists for scraping and I'm not going to run out of keywords any time soon but with this tool you could generate potentially huge amounts of keywords when support for other languages is added.

magically · April 2015

@delta_squad

You are absolutely correct here - I also used furykyle's keyword lists until recently, as they cover other languages.

However - It would be a nice addition, if we are able to generate our very own keywords 'on the fly' whenever we want to.

On top of that, we will reduce amount of people using the very same keywords in the scrapings.

This tool enables us to do exactly such kind of tasks;)

magically · April 2015

@Kaine

Well I actually build another proto-type (even though my time is limited during Easter)....

This little proto-type demonstrates most parts of your suggestion:

This demo scans 3 urls and extracts their KeyWords.

Results are printet out on screen - just for test purpose.

Bare in mind that this demo, just runs in a prompt to show the idea...

I think you can imagine the rest of the story

- Combining those results and then sort them...

- Add suppport for proxy

- Enhancement to be able to run as multi-threaded

-Etc...

Could be useful for some, to include such a feature into this Tool;)

magically · April 2015

Just some new updates on topic...

- I changed the Grafical Interface a little

- Prepared support of various sorting algorithms (Via User Selection)

- Added Tab Feature (Additional Tools will be added)

- Proto Type of URL Keyword Extractor prepared (Will be added in Tab2)

Upcomming Tabs:

Clean Scrapings

.pdf, .xml, .mp3, .chm, .ppt and so on...

- Ensure unique urls for GSA

EDU/GOV Sorter:

- Sort all urls - Keep only .edu and .gov

- Keep Unique Urls Only

- Remove unneeded extensions like .xml, .pdf etc...

Kaine · April 2015

Think you can make good tool with that

magically · April 2015

@Kaine

Many thanks for you kind words:)

Work is in full progress, and I will update this thread from time to time

magically · April 2015

Additional features implemented - Bare in mind it's still not complete...

Features Added:

- Sort scrapings from scrapebox/gscraper

- Remove all unneeded like .xml, .pdf, .mp3, .swf, .pdf and more...

- Keep unique urls only (remove duplicate urls - not domain)

- Added switch to change sorting algorithm - Mode Normal or Mode EDU/GOV

The demo below is using 'Mode Normal' - as the actual switch is not complete yet..

Upcomming Work:

-Final implementation of Keyword Scraper (Tab 2)

- Final implementation of switch for normal versus edu/gov mode

- Adjustments of GUI

- Cleanup and test

Later releases will include:

- Support for different languages (Keyword Generator - Algorithms)

- Enhancements of existing functionality

- Addon's

Stay tuned for more information;)

magically · April 2015

Update:

- Switch for sorting scrapings, either in "Normal Mode" or "EDU/GOV Mode" fully implemented.

in both cases, all garbage is removed 'on the fly', leaving just a list of UNIQUE URLS as a result.

*Garbage = .xml, .pdf, .mp3, .swf, .pdf and more...

magically · April 2015

URL Keyword Extractor Update:

- Enhanced the code and implemented support for 30 threads (default)

Note: Dont mind the messy output in the picture - Will be sorted in the Graphical Part.

In progress:

- Implementation of the graphical part of this feature (GUI)

- Cleaning up

Kaine · April 2015

It take keyword meta or keyword in page ?

At this level maybe add directly article web scraper too ...

magically · April 2015

@Kaine

It currently takes the meta keywords from the target-page.

- Support for both could be implemented later too.

And and article scraper seems to be interesting to add too;)

*Your suggestion has been added to the to-do-list.

magically · April 2015

Update - Early implementation of the URL Scraper Function in the Graphical Interface!

Tasks completed:

Features:

- Using 30 Threads

- Scraping Meta Keywords from Target URL'S (List Input)

- Generates a list with UNIQUE Keywords - Based on the results

- Showa Log-visits of target-urls in real time

*That's it for now - taking a small break to clear the mind and so I can look at it again with fresh eyes and do some cleaning/adjustsments.

Comments are still very welcome of course - feel free to jump in;)

magically · April 2015

PS...

I just took a larger sample to see if it actually works - 1000 random urls:

Finished all threads

unique words : 2734

total words : 7736

Destination: C:\SEO2015\WillItWork.txt

Closing Buffered Writer and finishing...

Speed was actually fast - exactly as expected;)

magically · April 2015

Update:

- Adjusted formating in Log Message

*Note: Maybe add funtion to trim to root - depending of the job.....

magically · April 2015

Here is a sample of the speed - using standard footprints and some keywords generated with this Tool:

Test was done using my home-connection and a laptop.

I'm quite sure some of you hardcore scrapers, are able to pull even higher speeds with some quality footprints...Unfortunately i'm not that good at making footprints:P

Actually the speed was increasing, at the moment of this comment: 34723 urls/pr min and still getting higher....

magically · April 2015

LOL - Better add the proof for you guys to see....

*Edit

A few min later:

*Edit

Last one - should settle it:P

magically · April 2015

Some might wonder - how about the performance in GSA Ser????

Well let's take a look:

Verified preview:

Performance (after cleaning the raw scrapings with this Tool)

Random picked message from GSA SER LOG:

And some more verified - just to show scraping did go well (Final Output):

Conclusion:

It's possible to use the new Tool, to generate decent keywords for scrapings and clean lists.

Feel free to comment;)

magically · April 2015

- Work in progress:

1. In the upcoming days, I intend to attempt to develop a simple Article Scraper, that will be added to this Tool-box.

2. Additional enhancements of existing code, and cleanup.

magically · April 2015

Update

- Basic implementation of URL Extractor (Small Part of Article Scraper)

- Demo Only (Just console - Not in GUI yet)

Extraction of all urls - on given target:

magically · April 2015

Update:

- Prepared MERGE FILES functionality - Merge several text-files into one text-file.

Demo - Consol Window:

The Graphical part, where user selects files is easy to implement!

The above image shows files A.txt+B.txt+C.txt are merged into one big file-->Merged.txt

A great feature to add into the existing ones in this Tool-Box for Scrapings

Feel free to comment

More to come...

magically · April 2015

Update:

For better understanding - I made a quick implementation in the Graphical User Interface:

Result:

magically · April 2015

Here is a demo of a large test.

1. I Merged Several files from Gscraper (Target18.txt)

2. I use the 'Clean Scrapings' function in Scraping Tool-Box (Target18-Cleaned.txt)

Ready to load into GSA SER.....

magically · April 2015

Update:

Prepared algorithm to generate various Random Footprints.

- Handy for the lazy ones - Make your tasks more random

- Will be added under "Various Tools"

Preview of article generation in console (A few only for demonstration purpose...):

The user will be able to select X-Number of Footprints - which will be generated randomly.