Skip to content

Removing Duplicates from Sitelist - Feature

AlexRAlexR Cape Town
edited February 2013 in Feature Requests
For remove duplicate domains in sitelist, can we have it so that the Blog platforms are deselected by default.
For remove duplicate URL's in sitelist can we have it so that blog platforms are ONLY platforms selected by default. 

Comments

  • AlexRAlexR Cape Town
    I'd also like it to by default exclude the Image Comment Platforms from the delete Duplicate domains,

    AS well as INCLUDE Image Comments Platforms with delete duplicate URL's.

    @LeeG - do you also do the image comment from duplicate URL or duplicate DOMAIN?
  • why you want a duplicare url from any platform on any list?
  • LeeGLeeG Eating your first bourne
    edited February 2013

    Im confused by that. Just nod and agree :D

    I dont do any guest book or image comments in all honesty

    Image comments I found I personally got very few results

    Duplicate urls just bulk out the site lists

    If I forget to kill them for a few days, they soon build up

    GG are you getting confused with duplicate domains?

    Something which I dont touch. You can have multiple engines types on domains, ie wikis, blogs and forums

    Blogs you might also hit a nice high pr page. Not all blog pages will be a uniform pr all pages

  • AlexRAlexR Cape Town
    @LeeG

    1) Image & Blog Comments are posted on a Page URL. So I'd like to keep these in my database. That's why I don't want the "delete duplicate domains for these platforms." i.e. you can have a PR 5 blog comment page on a site and a PR 1 blog comment page on the same site. Using delete dupe domains, may remove the PR5 page from sitelist. That's why they SHOULD auto be excluded from remove dupe DOMAINS. Same logic applies for image comments. 
    2) Thus, to handle images and blog pages, you should use delete duplicate URL's. BUT this should only have blog & image platforms selected by default. 
    3) I also think there should be an option, that when SER is not running at full speed, I.e. threads are down, it auto clears out the sitelists. 
    a) it should remove dupe domains and dupe URL's for blog & image platforms.
    b) it should recheck entries in sitelists to ensure platforms are still correct, site is live, i.e. ensure they are still working. I'd imaging after a year or two of sitelists...they may have been verified a year ago, but a lot changes and you have a lot of dead entries in your sitelists.

    @sven - could we have a feature when SER slows down, it does the above? It's just for me to dedupe is crashing my system if I select all platforms because the lists have got too big. Would be nice if it did this in background in regular stages. 
  • LeeGLeeG Eating your first bourne

    Just as I thought, you are more confused than normal, that why I asked if you was talking about remove dulicate urls or duplicate domains

    You started off by asking about duplicate URLS and then turned it to the duplicate domain option.

    Two totally different things

     

    Remove duplicate domains > do not touch the remove button

     

    Remove duplicate URLS, the option above > do this on a regular basis

    This clears all the bulk. You can have 200 pages all of the same url listed in your site lists

    www.site1.com/blogpage1.html

    www.site1.com/blogpage1.html

    www.site1.com/blogpage1.html

    etc etc etc

  • AlexRAlexR Cape Town
    edited February 2013
    @LeeG - not sure I'm confused here, but maybe explaining badly.

    Remove Duplicate URL's - useful for blog & image comments or guestbooks, where the link is placed on a page. Like in your sample below it should remove them. That's why I am saying it should not be enabled for all platforms.

    www.site1.com/blogpage1.html

    www.site1.com/blogpage1.html

    www.site1.com/blogpage1.html


    But it should keep:

    www.site1.com/blogpage1.html

    www.site1.com/blogpage2.html

    www.site1.com/blogpage3.html


    Remove Duplicate Domains:

    www.site1.com/my-article-url1

    www.site1.com/my-article-url2

    www.site1.com/my-article-ur3

    www.site1.com/my-article-url1000


    OR

    www.socnet1.com/profile1

    www.socnet1.com/profile2

    www.socnet1.com/profile5000


    For other platforms (wiki's, articles, soc networks), surely you'd only want to keep the www.site1.com or www.socnet1.com as 1 entry rather than have 1000 Unique site1 or socnet1 pages. Surely only remove dupe domains would sort this out? 

    BUT BUT

    You do not want to use remove dupe DOMAINS on blogs, images or guestbooks where it's helpful to have the unique PAGES. 


    That's why I am asking for the default platforms with each option to be enabled. 


    @sven - would it be possible to select by platform GROUP in the dedupe lists rather than having to manually select each and every platform or deselect. This way we can quickly deselect image, blog and guestbook platforms, rather than having to tick and untick every box???

  • LeeGLeeG Eating your first bourne

    Unless your adding all links by hand, how do you know how each site is set up?

    One example and its an easy one is vbulletin, you have both forum and blog posting

    There are a lot of wikis that are on subdomains of sites

    Again, blogs can be in a subdomain

     

  • AlexRAlexR Cape Town
    @LeeG - do you ever use dedupe domains or is it always dedupe URL's you use?

    Good point on above. Appreciate you taking time to clarify. Looks like I'm only going to use dedupe URL from now...
  • LeeGLeeG Eating your first bourne

    Humour me on this.

    What global site lists do you save?

    Im only asking because I have a hunch on the ones you are and I will give you a good reason on what to save and what not to save

  • AlexRAlexR Cape Town
    I save the:
    identified
    submitted
    verified

    For T0's I use the submitted and verified.
    For T1's I use the identified. My goal is for the T1's to sort through the identified lists for URl's and then move them to the submitted or verified for T0 projects to use. Since my ID list is large there will be many that never made it to submitted list since I think when I used to use CS it solved many captchas wrongly, so going through list to recheck over next 2 months.
  • LeeGLeeG Eating your first bourne

    Just sue submitted and verified

    Even if a captcha is wrong, from my understanding, it will still go into the submitted list

  • AlexRAlexR Cape Town
    @sven - can you confirm:
    "Even if a captcha is wrong, from my understanding, it will still go into the submitted list"
  • SvenSven www.GSA-Online.de
    if captcha is wrong it will not go to the submitted list as the submission is not successful but failed, so that site will get added to that list instead.
  • mmm interesting but with a 85%-95% solved captchas its still useless to save identified list. IMO
  • @Rodol - but what if you're downloading site lists and importing URLs from Options button. Don't those get saved to Identified by default?
  • yes, if you scrape directly from SER those list are saved on the identified folder by default, but i don't use SER for scraping so i dont use that list at all.
  • edited April 2013
    Guys, allow me to bring this thread back, cause I did a little testing, it tells me @LeeG might be wrong on remove duplicate domains, and I think it could help others with better understanding how the two functions work. here goes:

    Preparation:
    1. Create a folder on desktop, name it "sitelist", add 4 sub-folders in it, "identified", "successful", "verified", "failed", open SER, change the corresponding folders to these 4 just created
    2. under each sub folder, create a .txt file, name it "sitelist_Article-Article Script.txt" (doesn't matter what's the name, just same file name under 4 subs)
    3. edit all 4 txt files with same following lines:

    All this pre-work is for keeping my real sitelist untouched, and less lines make it easier to see what the outcome really is.

    Remove duplicate urls(all 4 files selected):
    When click "remove duplicate urls", dialog shows 8 urls removed. EVERY/ALL file left with
    That's how remove duplicate urls works, guess you all know(in fact I didn't figure it out for a really long time)

    Remove duplicate domains:
    create a new txt under verified folder, name "sitelist_Article-BuddyPress.txt", paste in this:
    after de-dup domains(all 5 files selected), only 1 line is deleted(http://www.almostgallery.com/forum/1.html), subdomains are taken as different domains and they are NOT get deleted

    //anyway, for those who still didn't get this, conclusion:
    1. both remove duplicate url & domain will only proceed within each txt file separately, it won't go check if there is a dup url/domain both in buddypress and drupal, also it won't check if a same url/domain is both in verified list and success list.
    2. for de-dup, different subdomains = different domains, SER keeps them all
    3. folders under same subdomain/domain = duplicate domain, SER keeps the first line of that domain. 
  • AlexRAlexR Cape Town
    That would be a neat feature if we could dedupe across lists. I.e. link should only be in 1 list. If it's been verified, then it should be removed from "Submitted" and "Identified" list. Etc. Would keep things much neater!
  • @AlexR agreed, that was how I thought this remove function works, it will reduce lot of "already parsed" too.
    I also think what you brought up in the beginning of this thread should be added into SER.
Sign In or Register to comment.