Feature suggestion

sysco32 · March 2015

As we would like to automate everything.I miss the following things.
When i finish harvesting URLs from scrapebox.
First i would like to remove the dups from the file.If i use monitor folder for dupes i would like to set a folder where PI spits out the deduped files.This output folder would be important,because we don't want PI to start identifying the raw harvested urls,only the deduped ones./which is a very small percentage of the raw file/.
So in that case i would set an identify project which monitors the deduped output folder.Than the identified goes to it's set folder and the unrecognized would go to it's set folder.
It would be good if SER could monitor also the unrecognized url folder,and start trying to post to those URLs so i don't have to manually import the URLs into a project.The other option is to set a size limit for the unidentified file,than start to write a new one and i will not loose track which was the last url what i imported to a project.Otherwise it will import all the urls from the beginning.

Thank you

s4nt0s · March 2015

Ok, we'll be adding these extra dedup options over the next few days:

1) Save deduped files to another dir

2) Delete original files after dedup

3) When saving, append instead of overwriting files

Hopefully that should help with your issue

sysco32 · March 2015

@s4nt0s So fantastic.
So

1 it will save to another file,
2 i don't think that is a good idea to make it delete the file by any chance we want to use the file again,or the saved file got corrupted,we have a backup at least for few days.
3 so it will append to an existing save is good also if PI knows which one was the last URL

DO you have any solution for unidentified file for importing automatically?

Thank you

s4nt0s · March 2015

All of those are options, you will be able to enable/disable them as you wish.

For importing unidenfitied automatically, why not set that folder as a global site list in SER so it pulls from it in intervals?

sysco32 · March 2015

@s4nt0s

I tried it already,but the file name is different than in a site list.It didn't pull nothing.

s4nt0s · March 2015

Did you try switching the setting? You can change how file name is saved between SER and Pi.

If you go to SER, you can click big options butt > advanced tab > "file format". Switch it to the other one and see if it pulls in URLs.

sysco32 · March 2015

I didn't change the settings,cause SER is writing the verifieds in a different format,than i have to move all my urls to the different format...kinda busy,lazy to move all the urls

sysco32 · March 2015

@s4nt0s
Hi, i see the dedupe option with different save folder implemented! You guys rock!

Thank you very much!!!!

s4nt0s · March 2015

@sysco32 - Haha, ya we pushed that out in the last update. Glad you like it

sysco32 · April 2015

@s4nt0s

I have another suggestion.More like question.My jobs are running for a while now.I save the unidentified urls as well,as a lot of good links end up there.
So as long as SER can't monitor the folder like a global site list/i tried also to change it to name only/ i need to import the file manually.Which is not a problem.
The problem starts here: The file atm is over 1,5 GB.My VPS is strong,but was struggling to import this amount of URLs into a project and i ended up restarting the VPS.

So is there any option to apply a size/url/line limit for the unidentified file and when reached it would start a new one?
Let's say 200MB or if we could set it for ourselves it would be even better.
So we would have
unidentified.txt - 200MB
and the new one start like
unidentified01.txt

It would be also beneficial with space management,as we finished with the files...we can delete the ones that we don't need.

Thank you

s4nt0s · April 2015

Thanks for the feature suggestion. We'll consider adding this.

You can also try running a dup remove project on the unidentified URLS so it removes dups in that folder every x amount of time. I'm assuming your 1.5 gig file had a lot of dups.

We'll see about adding this feature though.

sysco32 · April 2015

The problem that it doesn't have any dups.I remove the dups before sending them to PI.So it is all unique.

sysco32 · April 2015

Ah ok.. i misunderstood it.Yes it is possible to have dups.Will take a look at it.

Feature suggestion

Comments