Skip to content

Keyword Generator - Made in Java - For Scraping

1356

Comments

  • magicallymagically http://i.imgur.com/Ban0Uo4.png
    edited April 2015
    All right - Once the first people has got the program and hopefully has activated it successfully, we will first approach potential bugs and try to correct these...

    - Next step is to enhance tab 4 'Various Tools' and add more features + enable progress-bar.
    Then we move on and create the article-scraper (some parts are done already).

    I also have a plan to develop a 'headless submitter' - i.e feature to post on some selected targets.
    Why? Well I think it's possible to approach some platforms differently than GSA SER does.
    Well do some testing 'on the side' and figure out if this indeed is possible.

    This is a process, an adventure - and over time I think it will be possible to create something really cool;)

  • magicallymagically http://i.imgur.com/Ban0Uo4.png
    edited April 2015
    Ohh I forgot in the previous message...

    Here is another Book in Spanish to get you going:


    NB: Make sure to select the proper sorting algorithm---->Latin, before hitting 'Go' button

    image
  • magicallymagically http://i.imgur.com/Ban0Uo4.png
    edited April 2015
    Work in progress for next release:

    - Split file (large file)

    I have prepared the code for 'Split File' - which will be added as a feature in next release.

    Early raw demo - without implementation in the GUI:

    In this sample a file of 349MB is split into 4 parts. The source file could have been significant larger - however this is just a sample/demo.

    image

    image



  • magicallymagically http://i.imgur.com/Ban0Uo4.png
    Articles Scraping - Sources

    I would appreciate if someone could mention a list with some good sources, besides ezinearticles.

    A list with 10-15 sources would be a good starting point!

    I need some sources to work on;)
  • KaineKaine thebestindexer.com
    edited April 2015
    I think it's better if you can see for website article, no directory. To much user/duplicate time after time.

    After you see what is good, simply share footprint used for find good source for your software :)
  • magicallymagically http://i.imgur.com/Ban0Uo4.png
    edited April 2015
    @Kaine
    Okay, let me see if I understand you correctly...

    What you are saying, is that the 'traditional' way of 'getting articles' is overused, resulting in way to similiar content and duplicates...Also because most scrapers are using the same sources...

    So, in order to prevent making the same mistakes - a new approach is needed.

    How?

    1.
    - By feeding the program with 'special footprints' - combined with 'targeted keywords' - avoiding directories.

    That would involve making use of a search engine like google or other search engine...
    Grab results and filter out the 'bad directories'.
    Scrape and deliver content after removing source url and copyright stuff....

    Example searching Google with simple footprints:

    image

    Perhaps adding some additional filters, like define required length of article content etc...

    Semi Automatic:

    *The above results could be listed as 'clickable' in a panel, and if content is okay, user can select this [x]

    When all targets are selected - finally scrape everything and write out articles....

    *Perhaps running different sequences with different footprints and keywords, and present search results prior to scraping everything....


    2. Other suggestions or strategies are welcome
     - Please add suggestions, strategies;)
  • KaineKaine thebestindexer.com
    edited April 2015
    magically 

    I mean, found website where your software can scrape article easily.

    Example <article> .... </article> in html5.

    Then, found good Footprints for locate that on the web and users scrape this urls for insert into the software.

    For exemple footprint of wicked article creator (not good directory):

    site:goarticles.com + 
    site:ezinemark.com + 
    site:examiner.com + 
    site:voices.yahoo.com + 
    site:articlebiz.com + 
    site:articletrader.com + 
    site:a1articles.com + 
    site:articlesnatch.com + 
    site:pubarticles.com + 
    site:articlealley.com + 
    site:ezinearticles.com + 
    site:buzzle.com + 
    site:selfgrowth.com + 
    site:brighthub.com + 
    site:suite101.com + 
    site:isnare.com + 
    site:articlecity.com + 
    site:articlerich.com + 
    site:ideamarketers.com + 
    site:articleslash.com + 
    site:articlepool.com + 
    site:abcarticledirectory.com + 
    site:searcharticles.net + 
    site:streetarticles.com + 
    site:articlealley.com
    site:articlecube.com + 
    site:sooperarticles.com + 
    site:bukisa.com + 
    site:infobarrel.com + 
    site:gather.com + 
    site:isnare.com + 


    Maybe:

    site:wordpress.com +
    site:blogger.com +
    .....


    can return good result.
  • magicallymagically http://i.imgur.com/Ban0Uo4.png
    @Kaine
    Hmm...I think we are talking about the same:D

    If you look at the image above, you will see some text high-lighted  in green: "blog" "skin care"
    That would be the search term = footprint + keyword

    It can actually handle your suggested footprint: site:wordpress.com + diet

    -->Example: If 5 footprints + keywords are given, it will repeat the seach with all footprints + keywords

    The real question would be if the user should have a chance to view the article, and if found good --> select it.

    For instance if 25 results are shown, the user finds 10 suitable and select those.
    It will then do the job, scrape the articles and output the text-files.

    - Or should that process be 100% automatic?
    -----------------------------------------------------------------------------------------------------------------------------------------

    Different Option:


    Perhaps you are suggesting to simply feed the scraper with urls you have found up front?

    So, you would ask the scraper to load a list with targets, and simply scrape those?
    Meaning if you feed it with 50 urls - it will scrape these and deliver the articles as output

  • KaineKaine thebestindexer.com
    edited April 2015
    magically 

    "Perhaps you are suggesting to simply feed the scraper with urls you have found up front?

    So, you would ask the scraper to load a list with targets, and simply scrape those?
    Meaning if you feed it with 50 urls - it will scrape these and deliver the articles as output"

    Yes exactly, soft only visit urls and grab article, no scrape urls.
  • magicallymagically http://i.imgur.com/Ban0Uo4.png
    Yep, I think I got your point right now:D

    Most users know how to make a decent footprint, so they can also do their own searching in google...

    So here is the scenario:

    1. Users makes their own target list (using their own footprints and keywords) example: site:wordpress.com + diet
    2. When they have collected enough information - they make the list.
    3. They load their list into the program.
    4. Program extracts all articles and write out text-files

    In other words:

    Program must have a feature to do the following:

    - Load targets (reference to articles of course)
    - Extract articles for each loaded url
    - Write article to file, for each url - without source information and copyright info.

    Is that correctly?
  • magicallymagically http://i.imgur.com/Ban0Uo4.png
    edited April 2015
    Update Work in progress for next release:

    - Implementation of 'SPLIT FILE' in the GUI (**See previous above, code is ready)

    - Prepared code for 'REMOVE DUPLICATE DOMAINS' (Will be implemented in the GUI)

    Obviously GSA SER does not calculate it correctly...:D

    Scraping Tool-Box removes all junk files (.pdf, .xml,.chm etc) + Removes Duplicate Domains in this new algorithm.

    image

    That will give users an opportunity to keep only UNIQUE URLS or Remove Duplicate Domains
     
    Compare Of Cases:

    Source file contained 4.615.209 urls

    Remaining Targets left: 724.826 (Scraping Tool-Box)
    Remaining Targets left: 713.554 (GSA SER)

    How come GSA Ser has less results, considering it doesn't remove junk files during dedupe?
    @Sven Could it be a bug:))


    -Preparation of Article Scraper and Implementation
  • magicallymagically http://i.imgur.com/Ban0Uo4.png
    @Kaine

    Here is a very basic proto-type of the article extracter:

    image

    Program was loaded with target:

    Site was visited and during visit, the article was extracted.
    Finally, it was printed out (console for demonstration)

    I hope it was something like that you have in mind?
  • KaineKaine thebestindexer.com
    edited April 2015
    magically 

    Yes is that :) you think is possible to lunch multiple page in same time and scape after ? (for no wait loading time).
  • magicallymagically http://i.imgur.com/Ban0Uo4.png
    @Kaine

    Great to hear:D

    Well, as you will feed the program with a 'known list of targets' - there is no need to open any browser during the process. 

    It will work similar to 'URL Key Word Scraper' - Using multiple threads to extract the articles.

    That will speed up the process rapidly.

    During the weekend I plan to start making some more testing and coding of this feature, and as usual I will update the thread during the implementation/Test Results etc.
  • Thanks to @Kaine for pointing me at this thread, I hadn't seen it before. 

    @magically good work dude, this looks a very handy tool.
  • magicallymagically http://i.imgur.com/Ban0Uo4.png
    @JudderMan
    Many thanks for your kind words, really appreciated:)
    - And indeed thanks to @Kaine as well for support, ideas and feedback.
  • magicallymagically http://i.imgur.com/Ban0Uo4.png
    -Work in progress:

    Early preview of upcoming new feature - Split File

    image

    Not completed yet, still some heavy coding left to do....

    Once this feature is fully implemented, the work of the article scraper will be initiated.
  • Can't wait to get this software. SHOW YOUR MAGIC...  :((
  • magicallymagically http://i.imgur.com/Ban0Uo4.png
    edited April 2015
    - Completed - Split File (Included in next release...)

    image

    Ability to select varios units:

    image

    Selection of target file that needs to be split:

    image

    Calculation of File Size in done 'on the fly'....

    image

    Process Initiated:

    image

    Task Completed:

    image

    Result:

    image

    Moving on to next feature - I Plan to start doing it during the weekend (if time allows me to do it:D):
    - Work in Progress: Article Extracter
    Stay Tuned;)
  • KaineKaine thebestindexer.com
    Is the best feature for me ^^
  • magicallymagically http://i.imgur.com/Ban0Uo4.png
    @Kaine
    Indeed buddy;)

    Very early GUI MOCK UP (Can't still change a lot)

    image

    Will see if I can get some time during the weekend, to make the code and enhance the GUI here...
    Stay tuned for progress and updates during the weekend and the upcomming week
  • magicallymagically http://i.imgur.com/Ban0Uo4.png
    - Added Detection of File Encoding Type (Under Various Tools Tab)
    It will detect which encoding a text-file is using 'on the fly' - really blasting fast!

    It can detect the following encoding types:
    Chinese:
    ISO-2022-CN
    BIG5
    EUC-TW
    GB18030

    Cyrillic:
    ISO-8859-5
    KOI8-R
    WINDOWS-1251
    MACCYRILLIC
    IBM866
    IBM855

    Greek:
    ISO-8859-7
    WINDOWS-1253

    Hebrew:
    ISO-8859-8
    WINDOWS-1255

    Japanese:
    ISO-2022-JP
    SHIFT_JIS
    EUC-JP

    Korean:
    ISO-2022-KR
    EUC-KR

    Unicode:
    UTF-8
    UTF-16BE / UTF-16LE

    Others:
    WINDOWS-1252

    image

    image

    image

  • magicallymagically http://i.imgur.com/Ban0Uo4.png
    Update - Almost completed the Article Extractor

    A small demo:

    1. We find 3 random targets using this footprint: site:wordpress.com + skincare

    In this case, the following were picked:
    Let's hit start and see what happens...

    image

    Now we look at the destination folder:

    image
    Indeed 3 articles has been extracted and generated:D

    Sample from article2:
    image


    To Do before release of Scraping Tool-Box 1.2

    - Minor adjustments in the GUI
    - Implementation of other minor stuff
    - Compile the program
    - Launch;)

    Expected timeframe for version 1.2:
    5-7 days

  • magicallymagically http://i.imgur.com/Ban0Uo4.png
    - Added Korean Language Support for Keyword Generator

    image

    As I can't read Korean - here is a translation:
    image


  • magicallymagically http://i.imgur.com/Ban0Uo4.png
    - Adjusted logfile:

    image
  • magicallymagically http://i.imgur.com/Ban0Uo4.png
    -Adjusted 'Article Extractor' GUI even futher:

    image

    - Please see the previous entry - Article Scraper completed
    Just needs very small fixes - and it's done;)
  • magicallymagically http://i.imgur.com/Ban0Uo4.png
    - Fixed formatting issues in 'URL Keyword Scraper'

    image

    Still to do, before release of version 1.2:
    - Minor adjustments & Enhancements
  • KaineKaine thebestindexer.com
    edited April 2015

    And that clean copyright ? Maybe if that can delete time of article is posted can be good with possibility to change url in article.
  • magicallymagically http://i.imgur.com/Ban0Uo4.png
    edited April 2015
    @Kaine
    hehehe:P

    Well, that will be added in v. 1.21

    Simply, because I need some feedback on how it works on a large amount of targets..
    And there is 1 more thing to consider too - before adding this last feature to the article extractor.

    Performance - Depending on the amount of targets, removal of different things, before writing the text-files could take some time. However it can be done;)


    I just need to see, how it works for you guys in 'real life first', before adding advanced 'tweaks':D

    So, I suggest to finish up the remaining stuff and simply release v. 1.2 for you guys to try, then we take it from there..

    PS:
    If there should be an url present in some of the articles - it's not complicated to generate a random  url as replacement;)

    Pick one randomly---->Now replace existing one with random... (Easy to implement later)


    PSS:

    I also hope to see more guys interested here -After all we are all in the same boat, so why not help eachother;)
  • magicallymagically http://i.imgur.com/Ban0Uo4.png
    - Prepared code to handle replacement of existing url in scraped text, before making final text-files.

    Here is a little demo (please note - it will not be implemented before version 1.21)

    Scraped demo text:

    When we entered our core range of Aurelia Probiotic Skincare products in to the<p><a href='http://www.aureliaskincare.com/aurelia-tv/'><b>example</b></a&gt; link.</p>Bible to be tested only a few weeks after our launch in January 2013, we could only dream of seeing one of our products in the final published book.

    image

    Demo shows: The existing url is replaced with: "http://www.SomeUrl.com/"

    Question: Will it work on any URL?
    Answer: Most likely not - However it will cover and handle quite a lot;)
Sign In or Register to comment.