Can I sort a list after taking all duplicate domains out?

googlealchemist · October 2021

I'm not sure what the most efficient way to do this is while still getting the best results in the end...

If I have millions of scraped urls to sort...its taking tons of time as lots of them are on the same domain.

For the main contextual kind of platforms like the articles etc we only need to identify the domain as the right platform once. But for the other platforms like comments, trackbacks, etc where we can post links on many different urls on the same domain...I want to identify all of those url targets on the same domain.

So I guess my quetion is...will pi sort out the platform no matter what single url from the domain I add to it to sort whether its the main homepage, or the specific registration page, or an inner page with a comment option etc...? Ie can I take my list to sort and put it into scrapebox and use the remove duplicate DOMAIN function vs just the remove duplicate URL function.

Then I have all my main contextual targets sorted in a list thats finished. Then I can take all of the platforms that post to multiple inner urls per domain and run a link exrtraction and/or site: search to get all of the inner url targets from them?

Is there a better way to do this or am I way off?

Thanks

s4nt0s · October 2021

Removing duplicate domains is fine, but overall detection might be better if URLS aren't trimmed to root. If you have "full detection mode" enabled in the settings, it will sort each URL with every engine it matches.

googlealchemist · October 2021

Nice, thanks...yes I have the full detection mode left enabled as I think it was by default...

This post kinda merges with this one...
https://forum.gsa-online.de/discussion/28731/how-deep-does-gsa-spyder-a-site-looking-for-a-postable-url/

So would it be an accurate sum up that I should take my raw scrapes and just use scrapebox to delete duplicate domains as is whether I harvested a homepage or inner url initially? Then import that into platform identifier, then import those results into gsa for posting with that "try to locate new url on "no engine match" (useful for some engines)" option checked?

ksatul · October 2021

s4nt0s said:

Removing duplicate domains is fine, but overall detection might be better if URLS aren't trimmed to root. If you have "full detection mode" enabled in the settings, it will sort each URL with every engine it matches.

where is that full detection mode ?

s4nt0s · October 2021

@googlealchemist - Yes, that should work.

@ksatul - Its in the settings. You can see it on the form here: https://docu.gsa-online.de/gsa_platform_identifier/settings

Can I sort a list after taking all duplicate domains out?

Comments