Skip to content

PHP Codes to removing duplicate url from one list compared to other list

boilerboiler Indonesia
edited December 2020 in GSA Search Engine Ranker
First i was trying to removing same urls from 2 different urls list and can't found it so I'm deciding to make it using php to removing urls from List 1 if found on List 2 on the same filename.
With this codes you can make sure that there is no duplicated urls on the other urls list if you need to run multiple urls list on your GSA SER

To make this codes work, you need to identifying first the raw urls using GSA Platform Identified or GSA SER and then putting them to 2 different folder then setup the path

Feel free to modifying or improving this codes
<div><br></div><?php<br>$files = glob('D:\GSA\List\GlobalList\*.txt');<br>$comparepath = "D:\GSA\List\IndonesiaList";<br>//$comparefile = "D:\htdocs\serpresults\betresultsEN.txt";<br>foreach ($files as $key => $value)<br>{<br>&nbsp;&nbsp;&nbsp; clearstatcache();<br>&nbsp;&nbsp;&nbsp; $i = 0;<br>&nbsp;&nbsp;&nbsp; $file_path = pathinfo($value);<br>&nbsp;&nbsp;&nbsp; $comparefile = $comparepath."/".$file_path['basename'];<br>&nbsp;&nbsp;&nbsp; if (file_exists($comparefile))<br>&nbsp;&nbsp;&nbsp; {<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; echo "Removing duplicate from ".$value." and ".$comparefile." ";<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $contenta = file($value);<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $contentb = file($comparefile);<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (!empty($contenta) && !empty($contentb))<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; {<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; foreach ($contenta as $keyc => $cvalue)<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; {<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $grhost = parse_url($cvalue, PHP_URL_HOST);<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (!empty($grhost))<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; {<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (empty(preg_grep("/\b$grhost\b/i", $contentb)))<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; {<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; //$result = $grscheme."://".$grhost;<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $cleanurl[] = $cvalue.PHP_EOL;<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; else<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; {<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $i++;<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; //$diff = array_diff($contentb, $contenta);<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $cleanfile = fopen($value, 'r+');<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $cleanarr = implode($cleanurl);<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ftruncate($cleanfile, 0);<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; rewind($cleanfile);<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; fwrite($cleanfile, $cleanarr);<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; fclose($cleanfile);<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; unset($contenta);<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; unset($contentb);<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; unset($cleanarr);<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<br>&nbsp;&nbsp;&nbsp; }<br>&nbsp;&nbsp;&nbsp; echo "Found ".$i." duplicate on ".$value."\n";<br>&nbsp;&nbsp;&nbsp; unset($i);<br>}<br>?>

Comments

  • SvenSven www.GSA-Online.de
    Thanks for the code. There is also one other way to get this done quickly. GSA Search engine Ranker has this included. Go to options->Advanced->tools->Remove duplicates from file
  • boilerboiler Indonesia
    edited December 2020
    Sven said:
    Thanks for the code. There is also one other way to get this done quickly. GSA Search engine Ranker has this included. Go to options->Advanced->tools->Remove duplicates from file
    it's only worked for 1 file i think
    the code i wrote will comparing the content from each file and removing it from the first list if url found on second list and since I'm using 3 list on my GSA, with this code i'm really sure that there is no same domain for each list :)

    for example domain abc.com will only found on list A but can't be find on List B or List C

    Removing duplicate from D:\GSA\List\GlobalList\sitelist_Article-BuddyPress.txt and D:\GSA\List\IndonesiaList/sitelist_Article-BuddyPress.txt Found 0 duplicate on D:\GSA\List\GlobalList\sitelist_Article-BuddyPress.txt
    Removing duplicate from D:\GSA\List\GlobalList\sitelist_Article-Catalyst Web CMS.txt and D:\GSA\List\IndonesiaList/sitelist_Article-Catalyst Web CMS.txt Found 0 duplicate on D:\GSA\List\GlobalList\sitelist_Article-Catalyst Web CMS.txt
    Removing duplicate from D:\GSA\List\GlobalList\sitelist_Article-ClassiPress.txt and D:\GSA\List\IndonesiaList/sitelist_Article-ClassiPress.txt Found 0 duplicate on D:\GSA\List\GlobalList\sitelist_Article-ClassiPress.txt
    Removing duplicate from D:\GSA\List\GlobalList\sitelist_Article-Drupal - Blog.txt and D:\GSA\List\IndonesiaList/sitelist_Article-Drupal - Blog.txt Found 0 duplicate on D:\GSA\List\GlobalList\sitelist_Article-Drupal - Blog.txt
Sign In or Register to comment.