Imported scraped targets causing high CPU
rastarr
Thailand
hey @Sven
I've been doing some scraping of target URLs with Scrapebox.
Now the first run produced over 7 million so I exported them in 100,000 target URL batch files.
I'm unsure if there's anything else to do before importing them but when I imported 200,000 URLs across 88 projects, the CPU hit the roof. Even running the scheduler at 30 projects every 30 minutes still has the CPU at over 90%.
Any ideas what might be the cause of this? (P.S I Sent you a bugreport with this thread URL as the description in case that helps)
I've been doing some scraping of target URLs with Scrapebox.
Now the first run produced over 7 million so I exported them in 100,000 target URL batch files.
I'm unsure if there's anything else to do before importing them but when I imported 200,000 URLs across 88 projects, the CPU hit the roof. Even running the scheduler at 30 projects every 30 minutes still has the CPU at over 90%.
Any ideas what might be the cause of this? (P.S I Sent you a bugreport with this thread URL as the description in case that helps)
Comments
Thanks heaps
Import URLs (holding site lists) << this will download each URL from the source you give it and extract all the URLs linked on it (also text only) and sort it in. A sample is e.g. pastebin.com URLs where a lot of URLs are usually listed one per line.
[1] Import URLs (identify platform and sort in) - I'm presuming that identified platforms will end up in the site list determined as 'Identified', right? It isn't going to sort to another like Submitted and/or 'Verified', correct?
[2] If that's true, then surely I must also have to use the 'Identified' site in projects for any newly discovered 'scraped' targets to get into Submitted and/or 'Verified' - is that correct or is there something else going on behind the scenes
I'm just trying to figure out the optimal way to integrate newly found (hopefully) Scrapebox targets into my GSA SER projects. Any hints or insights gratefully appreciated by you or others
Can you also explain why people would use both Submitted and Verified lists in a project, instead of just Verified?
What's the gain, if any, using Submitted since these have already been tested by at least one other project.
Isn't it better to only use Verified, and not waste time with Submitting more than once?
I'm baffled why people are using both.