Encoding Question: Sitelits From Scrapebox V2 ?

spiritfly · April 2015

I've tried feeding sitelites from scrapebox v2 directly into ser, but I'm seeing some encoding issues messing up my whole list.

I'm using an automated setup to scrape with scrapebox v2 using keyword lists with all kinds of languages. Scrapebox seems to support them all well. There were some that weren't, like some Korean I think, but upon contacting their support they fixed that as well. I can see they use: UCS-2 Little Endian on their exported files.

However I've noticed that GSA SER saves some lists as ANSI and others as UTF-8. Not sure why they are mixed. Anyway, some of the languages like Arabic, Korean, Chinese are not supported well even with UTF-8.

Even if I try to convert the lists from UCS-2 LIttle Endian to UTF-8, the URLs containing Arabic or Korean characters become like ????-???-????

So I'm wondering how should I get over this issue? How should I convert UCS-2 LIttle Endian lists to UTF-8 to feed into SER without losing the links containing some Korean, Arabic and other not supported by UTF-8 languages?

spiritfly · April 2015

And if I may add something else. I can see that SER saves some sitelist engine files as UTF-8 encoded and others ANSI. I may need to choose one of those two formats to convert all my sitelist engine files. Will it be wrong if I feed a list to SER in which all files are encoded as ANSI or all are encoded as UTF-8? And if there's nothing wrong with that, which of the two would be better to choose?

Sven · April 2015

How about ANSI ?

spiritfly · April 2015

I can see that GSA SER generates UTF-8 in it's verified lists for some engines so this is what confused me, but I guess I'll go with ANSI now.

But if I go with ANSI a lot of the links containing korean, arabic or any non ANSI letters will become ?????? - what to do about this? Does that even matter at the end?

Sven · April 2015

Well those URLs are encoded wrong anyway. What your browser does in the end is it will encode it to the %XY syntax (hexcode). So I don't get why Scrapebox will save the URLs in any other format than ANSI. Those URLs would prpably return a 404 in SER.

spiritfly · April 2015

In order to open properly in SER, they need to be converted to URL encoding am I right?

So the best way for scrapebox to save them would be ANSI, but URL encoded. Is this right?

Sven · April 2015

Well your webbrowser would probably accept it in any format. But internally (use a sniffer or read RFCs) it is always using urlencoded URLs.

spiritfly · April 2015

Oh my, what I meant with my previous question was not "web browser" but SER. I'll go ahead and edit it now

Sven · April 2015

yes should be url encoded all along so SER can handle it

spiritfly · April 2015

@Sven one more question related to this if I may. Not sure whether scrapebox will be adding the ability to export URL encoded, so I guess I will be converting the URLs on my own.

I will be using c# for the job and I've found two different methods for this:

HttpUtility.UrlPathEncode

and

HttpUtility.UrlEncode

Apparently the most important difference is in how they treat empty spaces. The UrlPathEncode will convert it as "%20" and the UrlEncode as "+" . "%20" seem to be the universally accepted one for a space sign, but for some reason microsoft is suggesting use of UrlEncode instead.

So I was wondering which one will GSA SER accept as empty space?

Sven · April 2015

The + was used by IE in the past and many browsers copied there way. Though as far as I can see it, things changed and almost all browsers use %20 now. A bit more overhead but might be better to use %20.

spiritfly · April 2015

Well it seems like scrapebox weren't happy about asking them to include URL encoded ANSI format as an option. Actually they told me to ask you make SER accept UTF8 and Unicode format and do the conversion inside, which when I think about it is not such a bad idea. Although I still think Scrapebox should be able to export the URL encoded format. Well nothing is ideal in this world, so I guess I'm left all by myself.

It's not that I can't code it, but I like to depend on as little outside scripts as possible. Things get messy when too many of them are used. So I hope SER would be able to accept and convert Unicode files internally as well some day.

I already did code something. I went with the UrlPathEncode and it seems to be producing good results. I'd appreciate an opinion as a last check up. These are the original URLs:

http://jettyplugin.googlecode.com/svn/trunk/reference/DATABASE and SQL/数据库中乐观锁和悲观锁.mht
http://itthadclub.com/node/تأجيل-مباراة-الاتحاد-والداخلية-بسبب-وفاة-يوسف-محي

This is how my script would convert them:

http://jettyplugin.googlecode.com/svn/trunk/reference/DATABASE and SQL/数据库中乐观锁和悲观锁.mht
http://itthadclub.com/node/تأجيل-مباراة-الاتحاد-والداخلية-بسبب-وفاة-يوسف-محي

I have no idea how to test if GSA SER will accept them or not, but can you just take a glance and tell me if I'm on the right track?

EDIT: It seems like the forum is converting them automatically and showing them with unicode, so I guess it's a sign I'm allright

Sven · April 2015

can you send me such an file? I will try to code something if they don't feel like making it right.

spiritfly · April 2015

Here is a sample file with Chinese, Arabic, Korean, Japanese and Russian links that I just managed to gather quick:

https://www.dropbox.com/s/flvbut84a7c0tm0/UrlEncodeSample.txt?dl=0 - Have fun with it

I finished my script, it was very fast with that method I mentioned, so it isn't really necessary getting into this just for me. But I, along with everyone else, would most certainly appreciate it if you can include Unicode conversion in SER

spiritfly · April 2015

Actually is not working that well as I initially thought. It seems to be converting all the characters until it meets a ? in the original URL. As soon as the first ? comes, it stops converting anything afterwards.

Not sure how to get over this.

spiritfly · April 2015

Solved it like this:

http://stackoverflow.com/questions/29635490/how-to-convert-unicode-text-file-with-urls-to-ansi-using-url-encoding

Seems to be working well. All I need now is a real world example. Let's see how it goes with SER.

Encoding Question: Sitelits From Scrapebox V2 ?

Comments