Skip to content

When importing URLs, should I remove duplicate URLs or domains?

Hello,

I use scrapebox for scraping URLs. When I import them into GSA, should I remove duplicate URLs or domains?

Is it a good idea to import them into global site list and then use that site list in projects?

Any other tips for importing?

Thanks..

Comments

  • BrandonBrandon Reputation Management Pro
    I import them directly into a project and then let that project run. You'll be left with a good Verified list. Be warned that you will likely have 1% or less successful.

    Removing Dupe URLS vs Dupe Domains is dependent on what you're trying to accomplish.
  • What do you do with the verified list after? Do you use global lists?

    I want to build as many (useful) links as possible. Is it better to remove just dupe URLs or remove dupe domains and scrape more?
  • edited October 2014
    You don't need to remove dupe domain if you just do blog comment, trackback, guestbook etc. and you should have many different urls in your project.  You don't want to hit the same domain with the same url many times.

    You can remove dup domain if you do posting to article, wiki etc.
  • ronron SERLists.com
    @ptr22 - Always dump that scraped file in Scrapebox and remove Duplicate URLs before you dump them in SER. There is no reason to process the same URLs.

    You will find that your scrapes always have duplicate URLs. In short, save yourself some time - and save SER some unnecessary processing. 
  • cheers guys, that's what I thought - remove dupe urls before importing and process dupe domains in SER
  • edited October 2014
    And what about the importing?

    Is it better to import them in advanced options (Import URLs - identify platform and sort in) and then use the Identified list in projects?
    Or import them directly into a project and use the Verified list?
  • ronron SERLists.com
    @ptr22 - Always import directly. Never use the sort and identify feature. SER automatically does that anyway as it processes the scrapes.
  • @ron - What's the advantage of doing that and what kind of project should I import them into?
  • edited October 2014
    much more interesting is to use grscraper and its powerful filters, make a list of forbidden words (diazepam, valium, viagra, weight etc.) and apply that to the list, apart from removing duplicates and others, also can indicate a number of pages maximum per domain, remove domains formed by numbers etc... We must purge the lists

    And then import as RON said without identifying motro.
  • ronron SERLists.com
    edited October 2014

    @ptr22 - You never want to import scrapes into a real project. Set up a fake URL for a domain that does not exist (to not hurt anybody) or Yahoo/Bing etc. as nothing can hurt them anyway. The purpose of the processing of scrapes is exactly that - to process scrapes - not to create links on real projects. Once you sorted through all the scrapes, all of the links go to your verified folder. That is what you use for your real projects.

    There has been much discussed on this forum to not use sort & identify mainly because it is far less efficient, slower, and consumes more memory than simply dumping them into these dummy projects I referenced. It's about speed and not wasting time or resources. 

  • Thanks @ron, but when I import them into the dummy project, the URLs will be sorted and identified anyway, right? So how do I actually save resources doing this?
  • ronron SERLists.com
    @ptr22 - You keep coming back to that, lol. Yes, when SER processes each target, it knows at that moment which platform and engine it is, and then deposits it into a special .txt file for that engine.

    Don't ask me to explain why SER is more efficient doing one thing versus another. The thing is that we the users know which is better for us. And that is what I was communicating to you.

    If you want to get into a discussion on coding, then you will have to speak with Sven on your own -  I want no part of that discussion haha.
  • given a choice between importing a site list into a project, and reading sites from a site list as part of the Project Options, do you have a preference? and why? 

    If Project A is reading sites from sitelist B, will Project A take note of additional sites that are added sitelist B during the the time that Project A is running?  This is my understanding of how SER works, but it seems that it doesn't always work that way in practice. 
  • ronron SERLists.com

    @tsgeric - Originally we used to like importing directly into the project because for some reason back then it was processing the targets lightning fast. Soon after that, the deal changed, and reading from the sitelist in projects options was even faster. So we have gotten away from imports because it is just extra work, and no benefit as well.

    Yes, it will catch up not too long after you add or merge additional files. I think it always best to stop SER when doing things like that and restart to get everything aligned properly in the system. I wouldn't do it while projects are running and expect it to know the file was changed. I think people push things too hard and want to be able to make changes while SER is running, and then expect good things to happen. I say make the changes, and restart SER. Plus it frees up a bunch of memory cache, and SER gets faster anyway.

  • thanks ron!!
  • ok, I'm just trying to find out how this works:)

    what engines and settings do you use in the dummy project?

  • contextual question ot guru's...say dupe urls are removed , should I further trim the url's before importing to SER?
  • 1linklist1linklist FREE TRIAL Linklists - VPM of 150+ - http://1linklist.com
    You should dedupe based on platform type. As a rule of thumb..

    Articles, Forums, Guestbooks, Social Networking sites, etc, need to be deduped by unique domain.

    Blog comments, trackbacks, picture posts, etc, need to be deduped by unique url.

    Essentially if its a platform where your going to make a link on an existing page (Like a wordpress blog) then you want unique urls.

    If its a platform where you make a new page (Like an article directory) you want a unique domain.

    Hope this helps!

    -Jordan
Sign In or Register to comment.