Could somebody please explain a few things from this screenshot?
Recently I scraped a few million URLs, set up 5 projects and imported around 2 million links into each. But when I look at the logs throughout the day, I see the same sites repeating over and over (please see screen below). Already deduped all of the lists, GSA only found around 1k dupes in each but this stuff still shows up:
A few questions regarding the above:
1. At the beginning we have 2 brackets [ ], sometimes empty, sometimes a "-" sometimes a "+" inside. What does each of these stand for?
2. At first I thought that "0685/7874" means that currently, the project is trying to post to site #685 out of all those that were imported. Looking at the picture, that obviously doesn't make sense since there are like 4-5 different domains that all have "0685". So what do these 2 numbers stand for?
3. As mentioned above, the same sites you see in the screenshot show up over and over again...why is GSA trying to post to these over and over? I already deduped the list and disabled "continuously try to post to a site even if failed before", yet a few hours later the log is still full of this and I fear this is wasting a ton of time and resources trying to post to the same sites.
Hope @Sven or someone else can shed some light on this.
A few questions regarding the above:
1. At the beginning we have 2 brackets [ ], sometimes empty, sometimes a "-" sometimes a "+" inside. What does each of these stand for?
2. At first I thought that "0685/7874" means that currently, the project is trying to post to site #685 out of all those that were imported. Looking at the picture, that obviously doesn't make sense since there are like 4-5 different domains that all have "0685". So what do these 2 numbers stand for?
3. As mentioned above, the same sites you see in the screenshot show up over and over again...why is GSA trying to post to these over and over? I already deduped the list and disabled "continuously try to post to a site even if failed before", yet a few hours later the log is still full of this and I fear this is wasting a ton of time and resources trying to post to the same sites.
Hope @Sven or someone else can shed some light on this.
Comments
Unfortunately can't edit OP anymore but for some reason I enabled the sitelist "identified" (about 30 mins ago) and look what's there:
These are obviously not official engines (hence "unknown"), yet GSA seems to think they are (according to the log: "matches engine opera.com", lol)...though that still wouldn't explain why it tries to post over and over again to the same URL. Not sure, maybe it's still helpful in figuring out the issue.
Anyway, it's disabled, problem still there though.
Anybody else?
I don't get it, these aren't even officially integrated engines, so why is SER saying "matching engine netlog.com" when there is none? Almost every URL/domain of my scraped list is being identified as either opera or netlog...
My understanding is obviously limited so I'd really appreciate some help with this.
Since this started my LPM is down from 80 to 15...
Edit: I see netlog.com and opera.com are actual engines inside the web 2.0's ("where to submit") - I'll uncheck these for now hoping that it fixes it. However, why does it then say "unknown" for these engines if they are officially inside GSA? And is it normal that GSA thinks for hundreds of thousands of URLs that they could be opera/netlog?
This means either a successful/positive message (+) or a negative one (-). There are also [!] for attention or [ ] for neutral.
>2. At first I thought that "0685/7874" means that currently, the project is trying to post to site #685 out of all those that were imported. Looking at the picture, that obviously doesn't make sense since there are like 4-5 different domains that all have "0685". So what do these 2 numbers stand for?
This means that the same URL is matching a couple of engines. I guess thats the problem you got as well. So it means site 0685 out of 7874 it is working on.