Where do I find CAPTCHA sources?

DeeeeeeeeDeeeeeeee the Americas
Here's to a productive week, GSA pplz!! :) Need to know where to find example CAPTCHAs to feed the SDK in GSA's Captcha Breaker software package.

I already know that I can find them online by getting lists of sites if I scrape for something specific in the catpcha-creating JavaScript, if it's done that way, or even something in the generated HTML, if it's a PHP script making the test.

Or, find a single site if it's a uniquely executed CAPTCHA, and just get many images generated by that same site on refresh.

But what about locally? I know I set GSA to save CAPTCHAs when I started. I also had CS in use for a bit. And now with CB, they're saving as well, I guess?


    We either do it by hand for a good sized batch with plenty of variation (100 or so) or script it ourselves. Cb will let you save them to though, as you said. Just build a list of sites by hand and run them through repeatedly until you get a big enough sample size I suppose.
  • DeeeeeeeeDeeeeeeee the Americas
    OK, sounds like a plan. Thanks for the ideas. :)

    Maybe I should also start with CAPTCHAs that aren't uber-challenging? :|

    So far, I've only tried improving a set that was already in CB which had a fairly low solve rate. I think my sample size was too small, but after the brute force with a few new filter profile sets, solve rate apparently was improved. So I saved it.

    Is this a REAL improvement, or is it limited b/c my sample size was too small?

    In other words, does the larger sample size help CB to solve more variations within that particular CAPTCHA and settings?
  • DeeeeeeeeDeeeeeeee the Americas
    OK...Browsing thru the CB folders...

    When trying to improve, do I use as many sources as possible,  both solved and unsolved?
  • SvenSven
    yes, it has to be a natural mix, else you train it on one or the other type only.
  • DeeeeeeeeDeeeeeeee the Americas
    edited April 2018
    OK. I started with 199 samples of the captcha from a site I have in the SDK.

    Like I said, I think I want to start with something else. :p

    I've gotten NOWHERE. haha But I am also inspired to mess with that captcha generator a lot more, as well as DL and explore the other freeware php generators and see if there are any drastically different ideas I can add.

    I just want to mess with them some more.

    Yes; the goal is to make terribly maddening captchas. >:) lol

    If *I* can visually distinguish them, why can't a user do the same?? I am not trained in captcha deciphering like CB is.* :|  And I still get nearly 100% right.

    I admit, this one does take a few seconds of gazing upon to get right, but it's not like IMPOSSIBLE. I guess the goal for hard ones is to make them challenging, but not so challenging that a user will close their browser cursing me. lol Unless I really don't want logins, just want to seem to or if logins are closed to all but the dedicated. Hmm..

    I CAN read them, just seems I am not proceeding in the right way to get CB to recognize the characters! It's not seeing ANYthing on OCR1,2, or 3. :(

    *edit: To be fair, I did choose the typefaces so I already know what the letterforms will look like. I wonder how another person would fare.

    @Sven, how do I check the CAPTCHAs against an external service? I tried, but CB didn't do anything, so I didn't do it right.
  • SvenSven
    what captcha is that?
    Basically you would do the following:

    • 1. load captchas
    • 2. make sure they all are correctly answered
    • 3. click DETECT
    • 4. click on the red label so it also assigns the chars that are probably missing
    • 5. go back to the filters and experiment a bit till you find it good looking (remove noize, threshhold...)
    • 6. right click on filters->auto optimize (only if it already has a solution for one captcha at least)
    • 7. click brute force and let it use current filters first and not use all sets (popup answers YES, NO)
    • 8. when done let it auto optimize it again

    Thats basically the stuff I do for a new captcha.

  • DeeeeeeeeDeeeeeeee the Americas
    edited April 2018
    Thanks for the info, Sven.

    CB is actually trying to find answers now. :)

    (I am only unsure what you meant about clicking on the red label above.)

    The captcha I am working on solving with the 199 samples was by Paul Drain for GPL licensing for use with  OSCommerce and ZenCart. I modified it to make OCR harder (before getting into GSA), but not saying in what ways, publicly!  ;)  Not def trying to give the world ideas, in this regard. lol I'd rather find easy solves out there! :)  The human solvers are getting costlier!

    I also worked on a captcha last nite in CB that a user uploaded to the board yesterday. The letterforms all turned out looking squiggly-ended nand I had zero success. I didn't do the steps suggested above, tho....

    Both are kind of tough. lol

    I think I'm going to DL a captcha module and make signifcant mods to an *easy* captcha (that's new in SOME ways, but still VERY easy), and then solve a puzzle on my level of beginner, but still useful b/c it'll be a mod and somehow different than just re-solving  one that already has a high solve rate, and the new captcha can be used "olws" in the future.

    (like IRL but "on live web sites"? lol)   :)

    I'm sure you know some effective methods better than I do for keeping them difficult. For all I know, also, you could destroy the one I made (that "seems" difficult) in ten minutes! lol

  • DeeeeeeeeDeeeeeeee the Americas
    edited April 2018
    (I am only unsure what you meant about clicking on the red label above.)

    To anyone out there reading this and needing help: The red label includes all the known characters in the character set. So, CB gets it from the right answers, but you can add some chars, too.

    I guess, in part, this is why you need a large sample size; the probability of a character coming up as one of four characters out of potentially 70 or more characters makes it possible to miss a few otherwise.

    Remember, the character set could be incomplete. I've eliminated some letters in captchas b/c some letters look alike and that really drives users nuts. I guess that again, a large sample size lets you know what you're dealing with.

    Captcha Breaker is working on the one uploaded yesterday. So far, the best attempt has yielded a 70.87% success rate, no wait,it's 75.73% 77.67%!! :)

    Better than zero last nite with the squiggly lines.

    *SICK* SDK on CB, Sven! This actually WORKS, and works really well!!!!
