Need help solving this captcha

edited August 2015 in Need Help
Hello,

I am new to CB and currently testing it's capabilities. I got a service which has two captcha versions - is it possible to detect this in CB? or will I have to detect it beforehand? (detection is rather easy since background is white with one version and gray with the other).

here's a zip file with some samples, V1 was unsolvable for CB, V2 got some solved, currently running bruteforce on V2 only to see how good the solved rate might get.


Thank you for your help in advance
Andre


«1

Comments

  • SvenSven www.GSA-Online.de
    from what site are those captchas?
  • can't tell. not sure if it matters though? I got a larger sample set now, if that helps
  • SvenSven www.GSA-Online.de
    well it doesn't matter but would have been good to have it not in Unknown section.
  • does the section influence the algorithms used while brute forcing?
  • SvenSven www.GSA-Online.de
    not at all...just cosmetic ;)
  • any feedback? I got version 2 now on 31%, will get an even larger sample set to see if it stays at this rate.

    still no success at version 1, the merged characters are a pain, I can get the shapes just right, actually perfectly to read but since there's no way to segment all OCR fails.

    if there would be a filter like if more than n dots connect on x-axis split them in half / remove dots in the mid that might help to detect that, but doubt that it would do it properly for some characters like D, E, F,L, P, T, Z. I am pretty much out of ideas for v1

    going to upload v2 once I got the larger test set ready
  • SvenSven www.GSA-Online.de

    v2 = 48%

    v1 = 18 %

  • more sample data: https://www.dropbox.com/s/0f5yyepue6u9qxh/bigbig.zip?dl=0

    have to solve a lot of captchas for v1 now - which is currently the best known service to solve captchas? so I can add it and have them solved automatically?

    for v2 I got 585 samples, solve rate dropped to 24%, rerunning optimization now. Ill gather more samples if I found a away to autosolve them.

    btw. just purchased a cb license

    can you share your algorithms with me?
  • SvenSven www.GSA-Online.de
    Import this one
  • thank you, my results differ though:

    v2: Black on Gray: 18,01% for my big sample with 1327 images
    v2: my version: 25,92% for the same sample set
    v1: Gray on White: 1,54% for a sample of 649 images

    I will let some auto improvement / bruteforcing work over my v2 this night.

  • SvenSven www.GSA-Online.de
    well the results differ because of different samples and amount. The stuff I created was with the 100+ captchas, I didn't test against the new stuff.
  • v2: my version increased to 26,15% after optimization.

    can you give it another shot for v1 and v2? at the moment it doesnt look like I can solve v1 at all. will gather some more samples for both.
  • SvenSven www.GSA-Online.de
    this takes already way too long with all your massive samples. Are those samples a mix of good/bad solved? Else it is optimizing against the wrong sets.
  • I gathered those samples with a script directly from the site and solved them manually, currently that set contains the good matches I had with my algorithm as well as the bad matches.

    What do you think would make sense for the next steps?
  • edited August 2015
    made a mistake yesterday (was too late) my test folder had unfilled captchas in it, here's the latest export:

  • SvenSven www.GSA-Online.de
    well this is really not getting better here unless you have a good filter idea to code.
  • edited August 2015
    is there something that might do character matching even if they are not separated? it looks like characters are very similar and do not rotate? I had a version which merged the grayish and black strokes to big characters (still joined), having a filter like stamps with masks that overlap by n % (like 90-95% overlapping = match would extract that character) could help here. not sure if that was understandable I could draft my idea in photoshop I guess.

    this probably could work well since characters do not rotate nor change in size that much
  • SvenSven www.GSA-Online.de
    such a algo does only exists internally but it's way too slow to be useful. I didn't manage to make it fast and it would also require some manual input to prepare each char for some database.
  • please define way too slow.

    I got a limited set of chars for this problem, since I am looking forward to solve many of these captchas it's worth my time- what can I do to use this?
  • SvenSven www.GSA-Online.de
    when I did this algo for recaptcha, it took like 10 sec for each image.
  • that works for me. cb does multithreading?
  • SvenSven www.GSA-Online.de
    yes but really...getting this algo to work still requires a lot of work as you have to cut each char from the captchas into peaces so that i can make a db from it.
  • how much is much? if it's a few days it's ok..
  • SvenSven www.GSA-Online.de
    but the problem is, will it be worth it? I can not tell if the result will be good here.
  • everything > 20% will help here, got a few million captchas to solve
  • btw. in SDK I would like to have a delete option "delete all with positive match" so I can filter for images I dont have solutions for.
  • SvenSven www.GSA-Online.de

    well you have the option to delete captchas with none empty result. Deleting the once with correct result will not help you much as you define new filters that will not be used for the already answered captchas.

    The next filter set is only used when no answer on previous filter sets was given.

  • I might find a better filter on the ones that either dont have an answer at all or are answered incorrect right? going, right now it's not looking that good or bad. best match 7% on the set that had no right answer or no result at all. but I see what you mean, thanks for the input. will try it again on the completely empty ones
  • is there any way to multithread the bruteforcing? it takes ages currently and the machine it's running on is not under load at all
Sign In or Register to comment.