New OnPage SEO Function - Need Feedback

TOPtActics · June 2020

CompaniesKaine said:

... I don't know how this could be approached, it would take semantic dictionaries and some sort of AI to do the job as a search engine...

So you plug in two text corpus and calculate a similarity score between them, ranging from 0 to 1. This can be approached with NLP (Natural Language Processing). Like a "Bert" Model (https://en.wikipedia.org/wiki/BERT_(language_model) trained on a special loss function with some specifications. If you have any needs to get AI going for SEO needs, iam happy for ideas and a team up for launching a service like an API, for example on https://rapidapi.com/.

If Sven is interested in lifting up the captchabreaker game to AI level, i could help too. Its a combination of CNN (Convolutional Neural Networks) and RNN (Recurrent Neural Networks) networks, setted up behind as AI component. The user only needs to collect enough samples with labels (annotations) ~50-100 per Captchatype and press a button. A new net would automatically be trained on the camptcha type with high accurracy even on hard captchas. Sven wouldn't implement a python interpreter in the GSA products, so i think the only way to go is microservice approach via api.

Kaine · July 2020

Hello @TOPtActics

,

The idea seems nice after I still need to know how to achieve it.

I was talking about semantic dictionaries because it would have been possible to find (by correspondence) the dictionary that would have the most similarity with a text and therefore to know what it is talking about.

I do not know if this corresponds with the initial request but the interests would be multiple and appropriate for a software like Gsa Content Generator for example.

For GSA Keyword Research, this could make it possible to create a semantic graph which could be useful for the realization of a silo, or the study of the tree structure of a site.

Semantic Cocoon:

Image: https://648468.smushcdn.com/1413882/wp-content/uploads/2018/06/coconse-apres.png

Some interesting links:
https://gephi.org/
https://medialab.sciencespo.fr/en/tools/navicrawler/

Sven · July 2020

Latest update will create reports in HTML with all kinds of data that your SEO customers might need/want.

It's not perfect but slowly getting there. Let me know what you think.

Image: https://forum.gsa-online.de/uploads/editor/u1/35mim0s1zhtj.jpg

Image: https://forum.gsa-online.de/uploads/editor/cn/n1us0b5h3bhd.jpg

TOPtActics · July 2020

Kaine said:

Hello @TOPtActics ,
The idea seems nice after I still need to know how to achieve it.
I was talking about semantic dictionaries because it would have been possible to find (by correspondence) the dictionary that would have the most similarity with a text and therefore to know what it is talking about.
I do not know if this corresponds with the initial request but the interests would be multiple and appropriate for a software like Gsa Content Generator for example.
For GSA Keyword Research, this could make it possible to create a semantic graph which could be useful for the realization of a silo, or the study of the tree structure of a site.

Semantic Cocoon:
...

Hi Kaine,

my suggestion would be to go with a similarity score from a BERT model between all Corpuses (corpus = the content from a page for example) and to know what the corpus is talking about, do a topic modelling (https://en.wikipedia.org/wiki/Topic_model) for every corpus, too.

If there is interest in building such an API or a implementation, let me know.

TOPtActics · July 2020

Kaine said:

...
Some interesting links:
https://gephi.org/
...

Nice tool, this could be done too. BERT or every Neural network has a n-dimensional (n can be choosen) representation in the middle of the network. This n dimensions can be reduced with dimensionality reduction algos like t-SNE or PCA to get a 2-dimensional representation of an input e.g. a word for graphical analysis. The distance in this 2-D space or in the n-D space has a meaning. If you calculate eucledean distances or manhatten distance or any L-Norm in this space, you can interpret this as a word "meaning" distance. Similar words stand next to each other.

Kaine · July 2020

Sven said:

Latest update will create reports in HTML with all kinds of data that your SEO customers might need/want.
It's not perfect but slowly getting there. Let me know what you think.

It still adds value but I don't think I would use it in the short term. It is rather the "Keywords/Content" aspect that attracts me to this software.

I take this opportunity to respond to @TOPtActics since it comes together, all that relates to content, to its understanding, its optimization and even its generation is the most vital sector of present and future SEO and deserves each search going in this direction. The more Search Engine will develop their AI, the more our content will be flawless. Since I launched our indexing service, I have seen tons of different strategies and I think I can tell you that the age of the spinned text should soon come to an end. The detection is much better since the last update.

You will therefore have to adapt, write your own texts or have this part outsourced by writers if something new does not happen quickly enough.

Sven · July 2020

Latest update also includes keyword suggestions.

Image: https://forum.gsa-online.de/uploads/editor/b0/fsvvcj34w2yg.jpg

TOPtActics · August 2020

I coded a multi-categorizer with multi-language support some time ago, maybe this is interesting for you?

You put in any text you like in any language you like (de, en, fr, pl, sp, etc....) and you give em any categories you like (maybe your site categories) in any language you like and the model will put out probabilities for each category given that the text belongs to that category.

Also sentiment is possible:

Image: https://forum.gsa-online.de/uploads/editor/3q/5yybzqewkzyz.jpeg

Here is an example:

https://www.nytimes.com/2020/08/18/dining/black-jam-makers.html -- Food

pasted first paragraph in
pasted original nyt categories in

the model got it right.

Image: https://forum.gsa-online.de/uploads/editor/5b/6hja677k7um2.jpg

Another Example:

https://www.nytimes.com/2020/08/28/technology/microsoft-tiktok-lobbying.html -- Tech

Tech, Business, Politics & USA > 50% --> For me it seems the model got it right!

Image: https://forum.gsa-online.de/uploads/editor/c4/l5dvhel2fcz2.jpg

The special thing here is, normally you have to train a new neural net on every new label-set, here i can put in whatever label i like and the model will do its best to find a way the text similarity to the label (category in this example). This is totaly new in the game! One model to serve your needs for categorization.

For example i can just paste random new category i think about in my categroy-space like "app" and the model will handle it:

Image: https://forum.gsa-online.de/uploads/editor/sz/84ddj9uxu5df.jpg

Sven · August 2020

Sounds useful but how did you train it? It's a NN?

TOPtActics · August 2020

Its a NN with a special kind of training, architecture and a special cost function. The network puts the labels in a sentence called hypothese_template and compares the distance between the sentence and the hypothese_template and calculates the probabilties from that. the hypothese_template is somthing like --> 'This text is about {}.' But can be modified for better accuracy. But the base hypothese_template works pretty good out of the box.

TOPtActics · September 2020

I'am now done coding a little test API (nothing for much load, no redis etc..., maybe it can handle 200 requests/min). Iam on holiday next week, maybe i send you guys a link the week after. This screen is a GET-API, i will only deploy a POST-API, because the sequences can get to big.

Sven · September 2020

yes perfect

z3r · September 2020

Thanks @Sven for all the updates

Sven said:

1) Thanks for the explanation. Though I think it's just the same as it's added now but with different statistics behind. What you see in the listing now is really the data you would expect, maybe in some different order but still I think it's ranked good.

Bur yes, I will probably make a new listing for found terms on a new form to show more stats on each term/phrase.

2) I really would like not to add google APIs here.

- Still no chance to have google nlp api? Maybe optional for who want it, i'm using them with another tool and they provide a lot of different keywords that we might miss with ngram/td-idf

- I'm still having some issue when i search a English keywords for the US, in the top 10 it will show some local serps sometimes, any solution?
- Is possible to export in html all sites? at the moment only the personal report is html/excel
- An option to export only the keywords in html/excel without opening the full research, the list is so long that it start lagging and sometimes it freeze
- Can we select what we want in the full competitor research to speed up the process?

Sven · September 2020

- Still no chance to have google nlp api? Maybe optional for who want it, i'm using them with another tool and they provide a lot of different keywords that we might miss with ngram/td-idf

whats that other app? i will try and have a look again

- I'm still having some issue when i search a English keywords for the US, in the top 10 it will show some local serps sometimes, any solution?

please give a sample.

- Is possible to export in html all sites? at the moment only the personal report is html/excel

yes but that would be a bit useless as there is nothing to compare against!?

- Can we select what we want in the full competitor research to speed up the process?

Yes, you can click on configure and edit the ranking factor filters or right click on the factor to filter it out

- An option to export only the keywords in html/excel without opening the full research, the list is so long that it start lagging and sometimes it freeze

Where exactly is it loading slow? The ngram on full competitor research? You don't need this at all to get the ngram data. You can get it on main form with "Add->Extract from Website->Search"

z3r · September 2020

Sven said:

- Still no chance to have google nlp api? Maybe optional for who want it, i'm using them with another tool and they provide a lot of different keywords that we might miss with ngram/td-idf
whats that other app? i will try and have a look again
- I'm still having some issue when i search a English keywords for the US, in the top 10 it will show some local serps sometimes, any solution?
please give a sample.
- Is possible to export in html all sites? at the moment only the personal report is html/excel
yes but that would be a bit useless as there is nothing to compare against!?
- Can we select what we want in the full competitor research to speed up the process?
Yes, you can click on configure and edit the ranking factor filters or right click on the factor to filter it out
- An option to export only the keywords in html/excel without opening the full research, the list is so long that it start lagging and sometimes it freeze
Where exactly is it loading slow? The ngram on full competitor research? You don't need this at all to get the ngram data. You can get it on main form with "Add->Extract from Website->Search"

- Surferseo they usually show keywords from google nlp besides their suggestion http://prntscr.com/ufj5di
- http://prntscr.com/ufj60a you can see some website have /it filter and some are just italian websites
- Like you said it's useful when you have already something to compare but when we research some keywords/topic we have nothing to compare, we can see what is going on in the top 10 in that case
- option -> edit filter it works, right click -> filter highlighted and remove, it doesn't remove anything on my end
- It lags and freeze (sometimes) when i scroll down to the bottom the full competitor research. I just checked the ngram export from the main interface, it just show the full keyword list without all the info we have inside the full competitor research am i right?
- is it still in the roadmap td-idf keywords beside the ngram?

Sven · September 2020

- Like you said it's useful when you have already something to compare but when we research some

next update will allow you to export it in html and use the HTML report template as base

-> filter highlighted and remove, it doesn't remove anything on my end

what did you try to filter out? Some factors are dynamic.

- It lags and freeze (sometimes) when i scroll down to the bottom the full competitor research. I just checked the ngram export from the main interface, it just show the full keyword list without all the info we have inside the full competitor research am i right?

yes, it just shows you the extracted keywords, not details on it. Though I might add some more details if you need it.

- is it still in the roadmap td-idf keywords beside the ngram?

well td-ldf is just a different metric on how important the ngram extracted keyword might be. I don't see how I can get more keywords extracted here.

z3r · September 2020

Thanks @Sven

- Regarding the keyword export, all the data inside the full report are pretty useful, thats the main reason
- An option to filter keywords by words 1- 2 - 3 etc
- I tried to filter out the whole security section, the server line and http version line, it doesn't remove anything. ( refreshed ). Can we have a search function or something to find what we want inside the filter list? http://prntscr.com/ufldai

katewatson · September 2020

Hello Sven,
It's great. Thanks for sharing this new OnPage SEO functions

Sven · September 2020

- An option to filter keywords by words 1- 2 - 3 etc

you have that already. You can edit how many sites must have a word to take notice of it. edit ngram filter is what you are looking for.

TOPtActics · September 2020

Sven said:

...
- is it still in the roadmap td-idf keywords beside the ngram?
well td-ldf is just a different metric on how important the ngram extracted keyword might be. I don't see how I can get more keywords extracted here.

OK one freebie for you guys: A simple yet powerfull unsupervised algorithm for extracting keywords from texts is called ... TextRank ... (http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf). It is not as powerful as a good neural network, but its way more powerful than tf-idf wrangling. It uses graph-based ranking algorithms (explicitly it uses the good old PageRank

) to natural language texts. So TextRank build a graph that represents the text, and interconnects words or other text entities with meaningful relations. This "textgraph" is than evaluated with the pagerank algo:

Image: https://forum.gsa-online.de/uploads/editor/t8/a5w3lfemhxpw.png

Sven is right on that one, TextRank will not extract "more" Keywords, but it gives the better ranking among all the words and n-grams in the text in respect of importance/relevance to the text. More Keywords are not always better! You want the relevant keywords, not all keywords. If you want to extract all keywords, just extract all n-grams without a weight

. One method to get more "good/important/relevant" keywords not in the text itself is to lookup the nearby keywords from the best TextRank keywords with an pretrained embedding matrix.

If sven can implement it, great. If you need an api let me know, i have this already in my repo and i can set it up for you after i have done the setup for the "Content-Categorizer". I can also setup a much more powerful keyword extractor based on deeplearning if you guys are interested.

Kaine · September 2020

@TOPtActicsI don't understand all the abbreviations but it sounds really very interesting. We really have to push in this direction and you seem to have studied the subject well. There seems to be something to recover from the side of OpenAI with Elon Musk's Gpt-3(4)

https://www.cnbc.com/2020/07/23/openai-gpt3-explainer.html#:~:text=OpenAI first described GPT-3,and spam in vast quantities.

TOPtActics · September 2020

Elon Musk is just an investor from openai, like microsoft is too. GPT3 is closed so we cant rebuild it and it will be opened for ~400$/month. Its really cool what i see from the api playground of gpt3. By now the best general purpose model out there, but i think its to expensive and most of the tasks have to be "finetuned" on a specific problem set, to work good.

Sven · September 2020

@TOPtActics thanks for the ideas here. Though the extraction of phrases and keywords is already good in my eyes. The labeling of what might be more important than others is something to look at right now. This algorithm can help here as well.

z3r · October 2020

@Sven is possible at the moment to bulk quick competitor research? i have tried to select two or more but it always search the first keyword and open the top10 list or it's the same as using tool -> collect SEO score? manually searching one by one it takes few seconds while using the tool even with two keywords it takes few minutes

Sven · October 2020

@z3r you select the keywords you want the data for and click tools->collect meta data->on the dialog you choose to do this for selected items only.

z3r · October 2020

Sven said:

@z3r you select the keywords you want the data for and click tools->collect meta data->on the dialog you choose to do this for selected items only.

It take few minutes for two keywords

Sven · October 2020

that depends on the proxy setup here. Because a query to the search engine has to be done + all the search results have to be parsed to get a score.

Kaine · October 2020

@Sven

I find myself in a situation which could lead to a new option.

I have a list of keywords and I would like to be able to test their presence in a page of one of my sites.

It might be interesting to do a mass check of all the words so that you can only keep the words that are missing on the page.

I think that the check could be done directly on the html version (which would therefore include all the tags without distinction) and have in return only the missing words.

Sven · October 2020

sounds like a useful option. I will add support for it.

Kaine · October 2020

Yes i think too, thank you Sven!

New OnPage SEO Function - Need Feedback

Comments