A Few Questions?

spiritfly · February 2015

1. Will it be ok if I run multiple projects at the same time that write the identified urls into the same folder?
2. Will it be ok to run a Remove Duplicates Project on a folder where another project is identifying and saving the identified urls into at the same time?
3. Will it be ok if I run Remove Duplicates on a folder in which Scrapebox saves the harvested urls using the automator plugin (NOT harvested_sessions folder from scrapebox) while both running at the same time?
4. I'm running a monitoring project on a folder where scrapebox saves the urls. The file in that folder has around 74k urls and the monitoring project shows processed urls as: 115k and it's still counting. Can you elaborate what's all this about? (3 more projects are doing the same thing)
5. A best practice question: Is it better to use the harvested_sessions folder from scrapebox and process all urls through one project (all engines selected) OR harvest with scrapebox engine by engine(currently only articles and wiki) and then create a project in PI for each engine separately?

I have a few more, but I'd like to get these sorted out first

s4nt0s · February 2015

1) Yes

2) Yes

3) Yes, but SB does lock the file that its writing to so keep in mind that Pi won't be able to monitor the file that SB is writing to until it finishes

4) I will PM you about this to see what number you're talking about exactly. Could be a small bug here.

5) There really shouldn't be any difference, you can do it either way.

IdentifiedURLs · February 2015

for my own personal tastes, I like the "harvest with scrapebox engine by engine(currently only articles and wiki) and then create a project in PI for each engine separately" method

spiritfly · February 2015

I went with harvesting engine by engine. It seems to be working well and seems like it's less resource hungry than scanning for all engines. I have set scrapebox to harvest for articles and wiki in separate sessions. After that Platform Identifier will process each session with a separate project and save unidentified from both in a single folder. Then I process that folder with all engines with deep matching enabled. Not sure if this last step is wise, but still finds around 20% engines.

Maybe I should process unidentified separately again, I dunno.. What do you think?

s4nt0s · February 2015

@spiritfly - If you're resources are getting high, make sure you use the bandwidth limit. It usually makes a lot of difference in regards to CPU.

Whether you want to process the unidentified is up to you, it's up to you if you think its worth the time. Just remember you can always set a "monitor" project to monitor the unrecognized folder so its sorting those as they come in.

spiritfly · February 2015

Yeah I'm doing exactly that. I'm getting around 20% from monitoring the unidentified list so I guess it's worth it.

It's amazing how well things combine with Platform Identifier! Scraping seemed too complicated and not worth at first, but Platform Identifier made me want to love scraping!

s4nt0s · February 2015

@spiritfly - Happy to hear that. I'm surprised nobody has said much about PA/DA sorting. I've heard a lot of people wanting that for SER, but it's already included in Pi. Maybe people will figure it out eventually lol.

spiritfly · February 2015

Yeah I've seen that MOZ integration right from the start and I find it tempting, but I have a few concerns about it.

First I share the opinion with a few pros here that MOZ metrics can easily be spammed because I've seen many such sites. Despite that I still think it's the most accurate metric currently and I definitely plan to try and build a list using DA/PA filter to see how it goes.

Second thing is that I don't have a moz pro account yet. Actually I do have one, but it's expiring soon and I won't be renewing it. Can a free moz account be used to check that many links? I thought free moz accounts had some checking limits?

s4nt0s · February 2015

@spiritfly - I agree regarding DA/PA. I mention it because so many threads have popped up here about people wanting it integrated with SER.

It can use free accounts no problem, but ya there are limits to them. When it runs out of credits, it will start putting all the URLS in the unrecognized folder. There is probably a better way to do it, but that's how its setup right now.

spiritfly · February 2015

Hmm that promo from namecheap is still valid, I guess I can try this: http://moz.com/partner/namecheap120

A Few Questions?

Comments