Skip to content

Feature suggestion

As we would like to automate everything.I miss the following things.
When i finish harvesting URLs from scrapebox.
First i would like to remove the dups from the file.If i use monitor folder for dupes i would like to set a folder where PI spits out the deduped files.This output folder would be important,because we don't want PI to start identifying the raw harvested urls,only the deduped ones./which is a very small percentage of the raw file/.
So in that case i would set an identify project which monitors the deduped output folder.Than the identified goes to it's set folder and the unrecognized would go to it's set folder.
It would be good if SER could monitor also the unrecognized url folder,and start trying to post to those URLs so i don't have to manually import the URLs into a project.The other option is to set a size limit for the unidentified file,than start to write a new one and i will not loose track which was the last url what i imported to a project.Otherwise it will import all the urls from the beginning.

Thank you


Comments

  • s4nt0ss4nt0s Houston, Texas
    Ok, we'll be adding these extra dedup options over the next few days:

    1) Save deduped files to another dir
    2) Delete original files after dedup
    3) When saving, append instead of overwriting files

    Hopefully that should help with your issue :)
  • @s4nt0s So fantastic.
    So

    1 it will save to another file,
    2 i don't think that is a good idea to make it delete the file by any chance we want to use the file again,or the saved file got corrupted,we have a backup at least for few days.
    3 so it will append to an existing save is good also if PI knows which one was the last URL

    DO you have any solution for unidentified file for importing automatically?

    Thank you
  • s4nt0ss4nt0s Houston, Texas
    All of those are options, you will be able to enable/disable them as you wish. 

    For importing unidenfitied automatically, why not set that folder as a global site list in SER so it pulls from it in intervals?
  • @s4nt0s

    I tried it already,but the file name is different than in a site list.It didn't pull nothing.
  • s4nt0ss4nt0s Houston, Texas
    Did you try switching the setting? You can change how file name is saved between SER and Pi. 

    If you go to SER, you can click big options butt > advanced tab > "file format". Switch it to the other one and see if it pulls in URLs.
  • I didn't change the settings,cause SER is writing the verifieds in a different format,than i have to move all my urls to the different format...kinda busy,lazy to move all the urls :)
  • @s4nt0s
    Hi, i see the dedupe option with different save folder implemented! You guys rock!

    Thank you very much!!!!
  • s4nt0ss4nt0s Houston, Texas
    @sysco32 - Haha, ya we pushed that out in the last update. Glad you like it :)
  • @s4nt0s

    I have another suggestion.More like question.My jobs are running for a while now.I save the unidentified urls as well,as a lot of good links end up there.
    So as long as SER can't monitor the folder like a global site list/i tried also to change it to name only/ i need to import the file manually.Which is not a problem.
    The problem starts here: The file atm is over 1,5 GB.My VPS is strong,but was struggling to import this amount of URLs into a project and i ended up restarting the VPS.

    So is there any option to apply a size/url/line limit for the unidentified file and when reached it would start a new one?
    Let's say 200MB or if we could set it for ourselves it would be even better.
    So we would have
    unidentified.txt - 200MB
    and the new one start like
    unidentified01.txt

    It would be also beneficial with space management,as we finished with the files...we can delete the ones that we don't need.

    Thank you


  • s4nt0ss4nt0s Houston, Texas
    Thanks for the feature suggestion. We'll consider adding this.

    You can also try running a dup remove project on the unidentified URLS so it removes dups in that folder every x amount of time. I'm assuming your 1.5 gig file had a lot of dups.

    We'll see about adding this feature though.
  • The problem that it doesn't have any dups.I remove the dups before sending them to PI.So it is all unique.
  • Ah ok.. i misunderstood it.Yes it is possible to have dups.Will take a look at it.
Sign In or Register to comment.