Gsa Urls Identification

Marky · June 2019

I am trying to use gsa ser to identify urls from a list text file http://prntscr.com/nzvr3p
The text file has a list of about 10 Million Urls. Gsa ser is not procession the whole file and identification stops after around 300k-500k urls processing and the dialogue box pops up showing that the identification of the text file has been completed. The popup shows around 200k identified and 300k unknown urls. I have tried to process the file multiple times and ever tried making the text file smaller. But ser always doesnot identifies the whole list.
@Sven Is there any way Ser can process the whole file instead of just few urls ?

Marky · June 2019

Even I tried to run a ext file with just 1 Million Urls and Ser just did half of the urls, skipped more than 600k urls and the process was completed. Please have alook.

Image: http://forum.gsa-online.de/uploads/editor/cb/r3uf9m0z9p97.png

Sven · June 2019

Hard to say what this is without having a look at the file itself, but most likely it is either the file itself having a formatting problem or duplicate urls itself in it.

When you de-dupe the file using the TOOLS menu, is it still having the same amount of URLs?

Marky · June 2019

@Sven
The file is with no duplicates and I have also sent the files to you in pm. Please check

Sven · June 2019

I had a look at one of the files you sent and it is indeed an issue in the file itself.

Have a look at the smaller file (desktop-46-vps-duped_split_2_split_1.txt) and search for: /2014/02/23/white-night-melbourne-2014/

The next line has the bugged entry.

Marky · June 2019

@Sven Its an entry with space and other characters in url. So sere should have skipped that and process rest of the urls . Or is there way to do this ?

Sven · June 2019

no, it holds EOF 0x04 in it which is telling a file processor to stop reading.

Marky · June 2019

@Sven
I have spent huge resources in scraping and such urls with spaces are very min in the scraped file http://" k z w b " = " Ã²fÂ¿~ÃÂÂ¨R"
Please let me know if any possible solution to process the scraped file.

Sven · June 2019

the space is not the issue. You probably can'T see that EOF char in your viewer.

Marky · June 2019

I have also bought GSA PI and using it side by side. Its running fine and identifying the list too . Only Issue with it is it takes too much CPU and also I have found ser much better in identifications. But ser is also giving the problem here which PI is not having. SO thats why I have asked for the help if possible

Sven · June 2019

will try to fix it on next update, though it's not really a bug but an issue in the file structure.

Marky · June 2019

@Sven Thanks. I will really appreciate it. Will be waiting. Thanks again

Marky · June 2019

@Sven Thanks alot for the quick update. Just got the update so the first thing came in my mind how quick you are. Thanks again. WIll test the files again and will let you know the results too.

Marky · June 2019

@Sven
Earlier I was using scrapebox duperemove free tool to remove dupes from files and it was lightning fast with even files upto 5 GB. With the current update, I have deduped a 1 GB file and the difference in speed in prominent now.

Marky · June 2019

I am setting the identification again and hope that it will work too for the whole file now. lets hope for the best

Sven · June 2019

identification should work as well with damaged files...I added the same fix there.

Gsa Urls Identification

Comments