Gsa Urls Identification

I am trying to use gsa ser to identify urls from a list text file http://prntscr.com/nzvr3p
The text file has a list of about 10 Million Urls. Gsa ser is not procession the whole file and identification stops after around 300k-500k urls processing and the dialogue box pops up showing that the identification of the text file has been completed. The popup shows around 200k identified and 300k unknown urls. I have tried to process the file multiple times and ever tried making the text file smaller. But ser always doesnot identifies the whole list. 
@Sven Is there any way Ser can process the whole file instead of just few urls ?

Comments

  • Even I tried to run a ext file with just 1 Million Urls and Ser just did half of the urls, skipped more than 600k urls and the process was completed. Please have alook.
  • SvenSven www.GSA-Online.de
    Hard to say what this is without having a look at the file itself, but most likely it is either the file itself having a formatting problem or duplicate urls itself in it.

    When you de-dupe the file using the TOOLS menu, is it still having the same amount of URLs?
  • @Sven
    The file is with no duplicates and I have also sent the files to you in pm. Please check
  • SvenSven www.GSA-Online.de
    I had a look at one of the files you sent and it is indeed an issue in the file itself.
    Have a look at the smaller file (desktop-46-vps-duped_split_2_split_1.txt) and search for: /2014/02/23/white-night-melbourne-2014/

    The next line has the bugged entry.
  • @Sven Its an entry with space and other characters in url. So sere should have skipped that and process rest of the urls . Or is there way to do this ?
  • SvenSven www.GSA-Online.de
    no, it holds EOF 0x04 in it which is telling a file processor to stop reading.
  • @Sven
    I have spent huge resources in scraping and such urls with spaces are very min in the scraped file http://" k z w b " = " òf¿~Џ¨R"
    Please let me know if any possible solution to process the scraped file.
  • SvenSven www.GSA-Online.de
    the space is not the issue. You probably can'T see that EOF char in your viewer.
  • I have also bought GSA PI and using it side by side. Its running fine and identifying the list too . Only Issue with it is it takes too much CPU and also I have found ser much better in identifications. But ser is also giving the problem here which PI is not having. SO thats why I have asked for the help if possible
  • SvenSven www.GSA-Online.de
    will try to fix it on next update, though it's not really a bug but an issue in the file structure.
  • @Sven Thanks. I will really appreciate it. Will be waiting. Thanks again
  • @Sven Thanks alot for the quick update. Just got the update so the first thing came in my mind how quick you are. Thanks again. WIll test the files again and will let you know the results too.
  • @Sven
    Earlier I was using scrapebox duperemove free tool to remove dupes from files and it was lightning fast with even files upto 5 GB. With the current update, I have deduped a 1 GB file and the difference in speed in prominent now.
  • I am setting the identification again and hope that it will work too for the whole file now. lets hope for the best
  • SvenSven www.GSA-Online.de
    identification should work as well with damaged files...I added the same fix there.
Sign In or Register to comment.