Skip to content

How to remove scrape linkagainst the previous scrapes via Xrumer?

mmtj offering such helpful tips and process. you can check it here.
https://forum.gsa-online.de/discussion/838/independent-tool-for-importing-and-sorting-urls/p1

But I have a question:
For "Then we compare the scrape against the previous scrapes and previously identified lists by using xrumer"
How do you make it? I know how to make it in SB. but SB can only handle 1M links.
After couple of days, the total of previous scrapes will be extremely large. SB can't handle this.

So could you tell me how you do this in Xrummer.

doubleup
If you have a very fast VPS, you can make it quick. Make more than 500 thread, the sorting is really fast.
But that kind of VPS cost too much(8G RAM 8 CPU), so i just choose another cheap VPS, standard setting 2G RAM 2CPU.
And let it sort for 24 hours.  That is fine for me.


Comments

  • Added info.
    After couple of days scrape, the previous scrapes will reach more than 1000 millions, it is quite hard to make those comparison for new scrapes.
    How would you do that? thanks
  • Well, with xRumer it's a bit limited: You could save your previous scrape as the xblack and use the tools -> delete all links according to xblack. It's a bit tricky, but works okay. You can also use the analysis tool.

    Personally, we completely switched to UltraEdit and it's subtool UltraCompare, it's fast and can handle big files easily (5GB+). Simply merge your scrapes and use the remove duplicates feature.

    Here's how to remove the duplicates: http://www.ultraedit.com/support/tutorials_power_tips/ultraedit/advanced_column_based_sort.html#remove-duplicates. The Pro Version can compare files separately too without merging them. 

  • edited March 2014
    @mmtj Thanks so much for your help.
    I tried UltraEdit and UltraCompare, it is really fast to remove duplicates.
    But removing duplicates is not the hardest part, the hardest part is to compare previous scrapes and output the fresh one.

    I tried many setting on UltraCompare, but still haven't got the point.
    The compare feature is easy to use, just selection the new scrape and old scrape. Click compare and then select only show difference and then save the result.
    The problem is here. The saved file is alway like this.

    93280    *     http://outgoingchimpanzee.blogspot.com/
    93282    <!    http://outgoingcoal.blogspot.com/
    93283    <!    http://outgoingdesign.blogspot.com/
    93286    *     http://outgoingmanager.blogspot.com/
    93287    *     http://outgoingnew.blogspot.com/
    93288    *     http://outgoingstrobilanthes.blogspot.com/

    I only need links, but the result always contain line number, space * or <! .

    Do you have the same problem?  How do you work out this?
    Or you have a completely different way to make comparison and save the dfference(fresh unique) scrapes?
    Thanks for your help.
Sign In or Register to comment.