Encoding Question: Sitelits From Scrapebox V2 ?
I've tried feeding sitelites from scrapebox v2 directly into ser, but I'm seeing some encoding issues messing up my whole list.
I'm using an automated setup to scrape with scrapebox v2 using keyword lists with all kinds of languages. Scrapebox seems to support them all well. There were some that weren't, like some Korean I think, but upon contacting their support they fixed that as well. I can see they use: UCS-2 Little Endian on their exported files.
However I've noticed that GSA SER saves some lists as ANSI and others as UTF-8. Not sure why they are mixed. Anyway, some of the languages like Arabic, Korean, Chinese are not supported well even with UTF-8.
Even if I try to convert the lists from UCS-2 LIttle Endian to UTF-8, the URLs containing Arabic or Korean characters become like ????-???-????
So I'm wondering how should I get over this issue? How should I convert UCS-2 LIttle Endian lists to UTF-8 to feed into SER without losing the links containing some Korean, Arabic and other not supported by UTF-8 languages?
I'm using an automated setup to scrape with scrapebox v2 using keyword lists with all kinds of languages. Scrapebox seems to support them all well. There were some that weren't, like some Korean I think, but upon contacting their support they fixed that as well. I can see they use: UCS-2 Little Endian on their exported files.
However I've noticed that GSA SER saves some lists as ANSI and others as UTF-8. Not sure why they are mixed. Anyway, some of the languages like Arabic, Korean, Chinese are not supported well even with UTF-8.
Even if I try to convert the lists from UCS-2 LIttle Endian to UTF-8, the URLs containing Arabic or Korean characters become like ????-???-????
So I'm wondering how should I get over this issue? How should I convert UCS-2 LIttle Endian lists to UTF-8 to feed into SER without losing the links containing some Korean, Arabic and other not supported by UTF-8 languages?
Comments
But if I go with ANSI a lot of the links containing korean, arabic or any non ANSI letters will become ?????? - what to do about this? Does that even matter at the end?
So the best way for scrapebox to save them would be ANSI, but URL encoded. Is this right?
I will be using c# for the job and I've found two different methods for this:
HttpUtility.UrlPathEncode
and
HttpUtility.UrlEncode
Apparently the most important difference is in how they treat empty spaces. The UrlPathEncode will convert it as "%20" and the UrlEncode as "+" . "%20" seem to be the universally accepted one for a space sign, but for some reason microsoft is suggesting use of UrlEncode instead.
So I was wondering which one will GSA SER accept as empty space?
It's not that I can't code it, but I like to depend on as little outside scripts as possible. Things get messy when too many of them are used. So I hope SER would be able to accept and convert Unicode files internally as well some day.
I already did code something. I went with the UrlPathEncode and it seems to be producing good results. I'd appreciate an opinion as a last check up. These are the original URLs:
http://jettyplugin.googlecode.com/svn/trunk/reference/DATABASE and SQL/数据库中乐观锁和悲观锁.mht
http://itthadclub.com/node/تأجيل-مباراة-الاتحاد-والداخلية-بسبب-وفاة-يوسف-محي
This is how my script would convert them:
http://jettyplugin.googlecode.com/svn/trunk/reference/DATABASE and SQL/数据库中乐观锁和悲观锁.mht
http://itthadclub.com/node/تأجيل-مباراة-الاتحاد-والداخلية-بسبب-وفاة-يوسف-محي
I have no idea how to test if GSA SER will accept them or not, but can you just take a glance and tell me if I'm on the right track?
EDIT: It seems like the forum is converting them automatically and showing them with unicode, so I guess it's a sign I'm allright
https://www.dropbox.com/s/flvbut84a7c0tm0/UrlEncodeSample.txt?dl=0 - Have fun with it
I finished my script, it was very fast with that method I mentioned, so it isn't really necessary getting into this just for me. But I, along with everyone else, would most certainly appreciate it if you can include Unicode conversion in SER
Not sure how to get over this.
http://stackoverflow.com/questions/29635490/how-to-convert-unicode-text-file-with-urls-to-ansi-using-url-encoding
Seems to be working well. All I need now is a real world example. Let's see how it goes with SER.