Can I sort a list after taking all duplicate domains out?
googlealchemist
Anywhere I want
I'm not sure what the most efficient way to do this is while still getting the best results in the end...
If I have millions of scraped urls to sort...its taking tons of time as lots of them are on the same domain.
For the main contextual kind of platforms like the articles etc we only need to identify the domain as the right platform once. But for the other platforms like comments, trackbacks, etc where we can post links on many different urls on the same domain...I want to identify all of those url targets on the same domain.
So I guess my quetion is...will pi sort out the platform no matter what single url from the domain I add to it to sort whether its the main homepage, or the specific registration page, or an inner page with a comment option etc...? Ie can I take my list to sort and put it into scrapebox and use the remove duplicate DOMAIN function vs just the remove duplicate URL function.
Then I have all my main contextual targets sorted in a list thats finished. Then I can take all of the platforms that post to multiple inner urls per domain and run a link exrtraction and/or site: search to get all of the inner url targets from them?
Is there a better way to do this or am I way off?
Thanks
If I have millions of scraped urls to sort...its taking tons of time as lots of them are on the same domain.
For the main contextual kind of platforms like the articles etc we only need to identify the domain as the right platform once. But for the other platforms like comments, trackbacks, etc where we can post links on many different urls on the same domain...I want to identify all of those url targets on the same domain.
So I guess my quetion is...will pi sort out the platform no matter what single url from the domain I add to it to sort whether its the main homepage, or the specific registration page, or an inner page with a comment option etc...? Ie can I take my list to sort and put it into scrapebox and use the remove duplicate DOMAIN function vs just the remove duplicate URL function.
Then I have all my main contextual targets sorted in a list thats finished. Then I can take all of the platforms that post to multiple inner urls per domain and run a link exrtraction and/or site: search to get all of the inner url targets from them?
Is there a better way to do this or am I way off?
Thanks
Comments
This post kinda merges with this one...
https://forum.gsa-online.de/discussion/28731/how-deep-does-gsa-spyder-a-site-looking-for-a-postable-url/
So would it be an accurate sum up that I should take my raw scrapes and just use scrapebox to delete duplicate domains as is whether I harvested a homepage or inner url initially? Then import that into platform identifier, then import those results into gsa for posting with that "try to locate new url on "no engine match" (useful for some engines)" option checked?
@ksatul - Its in the settings. You can see it on the form here: https://docu.gsa-online.de/gsa_platform_identifier/settings