Skip to content

Can I sort a list after taking all duplicate domains out?

I'm not sure what the most efficient way to do this is while still getting the best results in the end...

If I have millions of scraped urls to sort...its taking tons of time as lots of them are on the same domain.

For the main contextual kind of platforms like the articles etc we only need to identify the domain as the right platform once. But for the other platforms like comments, trackbacks, etc where we can post links on many different urls on the same domain...I want to identify all of those url targets on the same domain.

So I guess my quetion is...will pi sort out the platform no matter what single url from the domain I add to it to sort whether its the main homepage, or the specific registration page, or an inner page with a comment option etc...? Ie can I take my list to sort and put it into scrapebox and use the remove duplicate DOMAIN function vs just the remove duplicate URL function.

Then I have all my main contextual targets sorted in a list thats finished. Then I can take all of the platforms that post to multiple inner urls per domain and run a link exrtraction and/or site: search to get all of the inner url targets from them?

Is there a better way to do this or am I way off? 

Thanks

Comments

Sign In or Register to comment.