When importing URLs, should I remove duplicate URLs or domains?

ptr22 · October 2014

Hello,

I use scrapebox for scraping URLs. When I import them into GSA, should I remove duplicate URLs or domains?

Is it a good idea to import them into global site list and then use that site list in projects?

Any other tips for importing?

Thanks..

Brandon · October 2014

I import them directly into a project and then let that project run. You'll be left with a good Verified list. Be warned that you will likely have 1% or less successful.

Removing Dupe URLS vs Dupe Domains is dependent on what you're trying to accomplish.

ptr22 · October 2014

What do you do with the verified list after? Do you use global lists?

I want to build as many (useful) links as possible. Is it better to remove just dupe URLs or remove dupe domains and scrape more?

blackseocn · October 2014

You don't need to remove dupe domain if you just do blog comment, trackback, guestbook etc. and you should have many different urls in your project. You don't want to hit the same domain with the same url many times.

You can remove dup domain if you do posting to article, wiki etc.

ron · October 2014

@ptr22 - Always dump that scraped file in Scrapebox and remove Duplicate URLs before you dump them in SER. There is no reason to process the same URLs.

You will find that your scrapes always have duplicate URLs. In short, save yourself some time - and save SER some unnecessary processing.

ptr22 · October 2014

cheers guys, that's what I thought - remove dupe urls before importing and process dupe domains in SER

ptr22 · October 2014

And what about the importing?

Is it better to import them in advanced options (Import URLs - identify platform and sort in) and then use the Identified list in projects?
Or import them directly into a project and use the Verified list?

ron · October 2014

@ptr22 - Always import directly. Never use the sort and identify feature. SER automatically does that anyway as it processes the scrapes.

ptr22 · October 2014

@ron - What's the advantage of doing that and what kind of project should I import them into?

porlapatria · October 2014

much more interesting is to use grscraper and its powerful filters, make a list of forbidden words (diazepam, valium, viagra, weight etc.) and apply that to the list, apart from removing duplicates and others, also can indicate a number of pages maximum per domain, remove domains formed by numbers etc... We must purge the lists

And then import as RON said without identifying motro.

ron · October 2014

@ptr22 - You never want to import scrapes into a real project. Set up a fake URL for a domain that does not exist (to not hurt anybody) or Yahoo/Bing etc. as nothing can hurt them anyway. The purpose of the processing of scrapes is exactly that - to process scrapes - not to create links on real projects. Once you sorted through all the scrapes, all of the links go to your verified folder. That is what you use for your real projects.

There has been much discussed on this forum to not use sort & identify mainly because it is far less efficient, slower, and consumes more memory than simply dumping them into these dummy projects I referenced. It's about speed and not wasting time or resources.

ptr22 · October 2014

Thanks @ron, but when I import them into the dummy project, the URLs will be sorted and identified anyway, right? So how do I actually save resources doing this?

ron · October 2014

@ptr22 - You keep coming back to that, lol. Yes, when SER processes each target, it knows at that moment which platform and engine it is, and then deposits it into a special .txt file for that engine.

Don't ask me to explain why SER is more efficient doing one thing versus another. The thing is that we the users know which is better for us. And that is what I was communicating to you.

If you want to get into a discussion on coding, then you will have to speak with Sven on your own - I want no part of that discussion haha.

tsgeric · October 2014

given a choice between importing a site list into a project, and reading sites from a site list as part of the Project Options, do you have a preference? and why?

If Project A is reading sites from sitelist B, will Project A take note of additional sites that are added sitelist B during the the time that Project A is running? This is my understanding of how SER works, but it seems that it doesn't always work that way in practice.

ron · October 2014

@tsgeric - Originally we used to like importing directly into the project because for some reason back then it was processing the targets lightning fast. Soon after that, the deal changed, and reading from the sitelist in projects options was even faster. So we have gotten away from imports because it is just extra work, and no benefit as well.

Yes, it will catch up not too long after you add or merge additional files. I think it always best to stop SER when doing things like that and restart to get everything aligned properly in the system. I wouldn't do it while projects are running and expect it to know the file was changed. I think people push things too hard and want to be able to make changes while SER is running, and then expect good things to happen. I say make the changes, and restart SER. Plus it frees up a bunch of memory cache, and SER gets faster anyway.

tsgeric · October 2014

thanks ron!!

ptr22 · October 2014

ok, I'm just trying to find out how this works:)

what engines and settings do you use in the dummy project?

rover1977 · October 2014

contextual question ot guru's...say dupe urls are removed , should I further trim the url's before importing to SER?

1linklist · October 2014

You should dedupe based on platform type. As a rule of thumb..

Articles, Forums, Guestbooks, Social Networking sites, etc, need to be deduped by unique domain.

Blog comments, trackbacks, picture posts, etc, need to be deduped by unique url.

Essentially if its a platform where your going to make a link on an existing page (Like a wordpress blog) then you want unique urls.

If its a platform where you make a new page (Like an article directory) you want a unique domain.

Hope this helps!

-Jordan

When importing URLs, should I remove duplicate URLs or domains?

Comments