GSA PI saving files named as unknown

AsimNawaz · September 2018

I have imported a list into GSA PI to remove the unknown ones. But there are 2 type of unknown files in the destination folder. What is the difference between then. One is simple unknown and the other ones is unknown as well as a platform type is added to the file http://prntscr.com/kvt08h

Also if the link is dead or a 404 then how will the identification work as?

I want to remove the dead ones too from the list.

@Sven can you help me in this regard. I just bought the tool yesterday and so trying to figure it out.

s4nt0s · September 2018

Do you have any other project options selected? I see you're using both the extended engines and normal SER engines. Keep in mind if you're using this just to find engines supported by SER, the extended engines aren't necessary.

The dead or 404 links will go to the unrecognized file/folder.

AsimNawaz · September 2018

@s4nt0s
I already have an identified list in gsa ser. But its now giving very low verification as compared to what it used to give. So what I am doing is getting files from identified folder and gettting them identified again. But facing 2 problems here. One is that I saperated engines files in groups. Like all articles files from sitelist and all comments and so on. So I created a project in PI named articles and input all the all identified article files from gsa ser and ran the project. So technically PI should identify articles platform and unknowns . But happening what is that its identifying all other engines too from articles files.

s4nt0s · September 2018

@asimnawaz - Some URL's can match multiple engines. If you only select the article category in Pi, then run those article URL's through, it will sort them in the article categories. Article sites can also end up in general blogs, fake user agents, pingback, etc. It seems like you have all engines selected including the extended engines. With extended engines selected you will end up having even more categories.

Try only selecting SER Engines > Article. Then import all the identified articles and run it.

AsimNawaz · September 2018

@s4nt0s
Thanks for clearing that.
I have scraped more than 43 million links from scrapebox last month. What can be the fastest way to get them all identified ?
Secondly If I need to remove only the dead ones or unidentified ones, what should be the best practice in PI?

s4nt0s · September 2018

1) You would just select the SER Engines and not the extended engines because extended engines uses regex for identification and eats up more resources. I would probably leave all project options default and increase thread count. You don't want to go too high with the threads, but bump it up some to see how your system handles it. If you uncheck the "limit bandwidth" option on the main UI, it will go faster but it will use a lot more CPU so be careful with that.

2) There really is no special way to do that. It would be a matter of running the list through Pi as usual. The dead ones/unidentified would go to the unrecognized.

AsimNawaz · September 2018

@s4nt0s
I recently added few projects on PI. I set 500 threads for one of them . later it started behaving weird http://prntscr.com/kxt6hj I clossed PI and started again. Created new projects and happened the same. When ever this happens the thread count I set to projects changes to a random number and also the selected engines also get few unchecked. What can be the problem ?

s4nt0s · September 2018

@AsimNawaz - What happened when it was behaving weird? Any pop ups?

I need to try and replicate it on my end.

AsimNawaz · September 2018

@s4nt0s
The project is beig shown as working but nothing happening http://prntscr.com/kxx184 http://prntscr.com/kxx1eq
Also the buttons are not clickabale and the projects are also blinking

s4nt0s · September 2018

@AsimNawaz - Wow, it looks like this is an unfortunate side effect of the recent DDoS attack that must have triggered something with the licensing.

I'm trying to get it figured out.

I thought a fresh reinstall worked, but it looks like there might be an issue with the threads. We might need to push an update for it. It does seem to be sorting URLS, but some things still seem to be off.

This is very strange. I'm looking into it and will try to get it figured out asap.

Sorry for the inconvenience.

AsimNawaz · September 2018

@s4nt0s
yes it was working fine 3 days ago but facing it from last 3 days. The project runs fine for few minutes and after that everything gets blanked. Also sometimes the urls/min crossed 40k+ urls/min and ofcourse all urls get unregonized in that . I tried multiple projects and thought it may be the issue so i created a single one then. The same errors continued. I then lowered the threads . Restarted and every tweak that I could have done but after all that I have to post it here to look as if there can be another solution

GSA PI saving files named as unknown

Comments