When importing URLs, should I remove duplicate URLs or domains?
Hello,
I use scrapebox for scraping URLs. When I import them into GSA, should I remove duplicate URLs or domains?
Is it a good idea to import them into global site list and then use that site list in projects?
Any other tips for importing?
Thanks..
I use scrapebox for scraping URLs. When I import them into GSA, should I remove duplicate URLs or domains?
Is it a good idea to import them into global site list and then use that site list in projects?
Any other tips for importing?
Thanks..
Comments
Removing Dupe URLS vs Dupe Domains is dependent on what you're trying to accomplish.
I want to build as many (useful) links as possible. Is it better to remove just dupe URLs or remove dupe domains and scrape more?
Is it better to import them in advanced options (Import URLs - identify platform and sort in) and then use the Identified list in projects?
Or import them directly into a project and use the Verified list?
@ptr22 - You never want to import scrapes into a real project. Set up a fake URL for a domain that does not exist (to not hurt anybody) or Yahoo/Bing etc. as nothing can hurt them anyway. The purpose of the processing of scrapes is exactly that - to process scrapes - not to create links on real projects. Once you sorted through all the scrapes, all of the links go to your verified folder. That is what you use for your real projects.
There has been much discussed on this forum to not use sort & identify mainly because it is far less efficient, slower, and consumes more memory than simply dumping them into these dummy projects I referenced. It's about speed and not wasting time or resources.
@tsgeric - Originally we used to like importing directly into the project because for some reason back then it was processing the targets lightning fast. Soon after that, the deal changed, and reading from the sitelist in projects options was even faster. So we have gotten away from imports because it is just extra work, and no benefit as well.
Yes, it will catch up not too long after you add or merge additional files. I think it always best to stop SER when doing things like that and restart to get everything aligned properly in the system. I wouldn't do it while projects are running and expect it to know the file was changed. I think people push things too hard and want to be able to make changes while SER is running, and then expect good things to happen. I say make the changes, and restart SER. Plus it frees up a bunch of memory cache, and SER gets faster anyway.
what engines and settings do you use in the dummy project?
Articles, Forums, Guestbooks, Social Networking sites, etc, need to be deduped by unique domain.
Blog comments, trackbacks, picture posts, etc, need to be deduped by unique url.
Essentially if its a platform where your going to make a link on an existing page (Like a wordpress blog) then you want unique urls.
If its a platform where you make a new page (Like an article directory) you want a unique domain.
Hope this helps!
-Jordan