already parsed - and other things

the_other_dude · May 2022

Hi,

Currently I am scraping URLs with scrapebox and custom tools. I have GSA PI monitoring those folders where my raw scrape lists are stored. The URLs are filtered by GSA PI and stored in the default SER identified list.

Issue 1: my verified link list creation project is set to automatically get links from the identified link list in the default identified folder, but SER does not automatically get fresh links from the identified site list. Instead I have to manually import target urls from the identified site list for SER to start doing anything after it runs out of links in the target link list.

Issue 2: SER does not remove the URLs from the identified link list once it has parsed them the first time. Is SER supposed to move URLs to submitted, verified or failed link lists if building them is enabled once it has parsed the links and submitted, verified, or failed?

Every time I import the URLs from the identified site list, SER spends most of its time parsing already parsed urls, since they are not being moved to different folders. is this normal? Do I need to change my settings?

Thank you

Image: https://forum.gsa-online.de/uploads/editor/ws/h5zspetrospg.png

Sven · May 2022

ISSUE1: I see in log that it uses your site lists (14/580 loaded)...meaning 14 new targets to post to from 580 loaded ones. Though I see no submission happening after that and why this happens is unknown to me. Maybe there is a further restriction behind it. BUT the loading from site lists works at least.

ISSUE2: If a target was pulled from the identified site lists, it is added to submitted/verified and failed if thats the result in a later process. It is however not added again to identified of course. I have just tested this on my end and it worked as expected. Maybe there is an access restriction on your end?

the_other_dude · May 2022

Should I run SER as admin?

Sven · May 2022

usually it's not required but maybe worth a try.

the_other_dude · May 2022

Sven said:

usually it's not required but maybe worth a try.

I think I figured this out. app data folder (and all sub directory folders) was read only. I just fixed that. Doh!

cherub · May 2022

Ouch! All those lost target urls!

the_other_dude · May 2022

cherub said:

Ouch! All those lost target urls!

fortunately it was just one day. but yes multiple instances of scrapebox scraping at 3500 threads each PLUS link extraction is A LOT of lost data. all I can do is laugh about it

the_other_dude · May 2022

I was wrong. The folders aren't write protected. For some reason windows has a square box inside of the read only indicator box. I thought that meant the folders were read only. apparently the square box means nothing. Only a check mark does. I switched away from windows many years ago.

already parsed - and other things

Comments