... I don't know how this could be approached, it would take semantic dictionaries and some sort of AI to do the job as a search engine...
So you plug in two text corpus and calculate a similarity score between them, ranging from 0 to 1. This can be approached with NLP (Natural Language Processing). Like a "Bert" Model (https://en.wikipedia.org/wiki/BERT_(language_model) trained on a special loss function with some specifications. If you have any needs to get AI going for SEO needs, iam happy for ideas and a team up for launching a service like an API, for example on https://rapidapi.com/.
If Sven is interested in lifting up the captchabreaker game to AI level, i could help too. Its a combination of CNN (Convolutional Neural Networks) and RNN (Recurrent Neural Networks) networks, setted up behind as AI component. The user only needs to collect enough samples with labels (annotations) ~50-100 per Captchatype and press a button. A new net would automatically be trained on the camptcha type with high accurracy even on hard captchas. Sven wouldn't implement a python interpreter in the GSA products, so i think the only way to go is microservice approach via api.
The idea seems nice after I still need to know how to achieve it.
I was talking about semantic dictionaries because it would have been possible to find (by correspondence) the dictionary that would have the most similarity with a text and therefore to know what it is talking about.
I do not know if this corresponds with the initial request but the interests would be multiple and appropriate for a software like Gsa Content Generator for example.
For GSA Keyword Research, this could make it possible to create a semantic graph which could be useful for the realization of a silo, or the study of the tree structure of a site.
The idea seems nice after I still need to know how to achieve it.
I was talking about semantic dictionaries because it would have been possible to find (by correspondence) the dictionary that would have the most similarity with a text and therefore to know what it is talking about.
I do not know if this corresponds with the initial request but the interests would be multiple and appropriate for a software like Gsa Content Generator for example.
For GSA Keyword Research, this could make it possible to create a semantic graph which could be useful for the realization of a silo, or the study of the tree structure of a site.
Semantic Cocoon: ...
Hi Kaine,
my suggestion would be to go with a similarity score from a BERT model between all Corpuses (corpus = the content from a page for example) and to know what the corpus is talking about, do a topic modelling (https://en.wikipedia.org/wiki/Topic_model) for every corpus, too.
If there is interest in building such an API or a implementation, let me know.
Nice tool, this could be done too. BERT or every Neural network has a n-dimensional (n can be choosen) representation in the middle of the network. This n dimensions can be reduced with dimensionality reduction algos like t-SNE or PCA to get a 2-dimensional representation of an input e.g. a word for graphical analysis. The distance in this 2-D space or in the n-D space has a meaning. If you calculate eucledean distances or manhatten distance or any L-Norm in this space, you can interpret this as a word "meaning" distance. Similar words stand next to each other.
Latest update will create reports in HTML with all kinds of data that your SEO customers might need/want.
It's not perfect but slowly getting there. Let me know what you think.
It still adds value but I don't think I would use it in the short term. It is rather the "Keywords/Content" aspect that attracts me to this software.
I take this opportunity to respond to @TOPtActics since it comes together, all that relates to content, to its understanding, its optimization and even its generation is the most vital sector of present and future SEO and deserves each search going in this direction. The more Search Engine will develop their AI, the more our content will be flawless. Since I launched our indexing service, I have seen tons of different strategies and I think I can tell you that the age of the spinned text should soon come to an end. The detection is much better since the last update.
You will therefore have to adapt, write your own texts or have this part outsourced by writers if something new does not happen quickly enough.
I coded a multi-categorizer with multi-language support some time ago, maybe this is interesting for you?
You put in any text you like in any language you like (de, en, fr, pl, sp, etc....) and you give em any categories you like (maybe your site categories) in any language you like and the model will put out probabilities for each category given that the text belongs to that category.
Tech, Business, Politics & USA > 50% --> For me it seems the model got it right!
The special thing here is, normally you have to train a new neural net on every new label-set, here i can put in whatever label i like and the model will do its best to find a way the text similarity to the label (category in this example). This is totaly new in the game! One model to serve your needs for categorization.
For example i can just paste random new category i think about in my categroy-space like "app" and the model will handle it:
Its a NN with a special kind of training, architecture and a special cost function. The network puts the labels in a sentence called hypothese_template and compares the distance between the sentence and the hypothese_template and calculates the probabilties from that. the hypothese_template is somthing like --> 'This text is about {}.' But can be modified for better accuracy. But the base hypothese_template works pretty good out of the box.
I'am now done coding a little test API (nothing for much load, no redis etc..., maybe it can handle 200 requests/min). Iam on holiday next week, maybe i send you guys a link the week after. This screen is a GET-API, i will only deploy a POST-API, because the sequences can get to big.
1) Thanks for the explanation. Though I think it's just the same as it's added now but with different statistics behind. What you see in the listing now is really the data you would expect, maybe in some different order but still I think it's ranked good.
Bur yes, I will probably make a new listing for found terms on a new form to show more stats on each term/phrase.
2) I really would like not to add google APIs here.
- Still no chance to have google nlp api? Maybe optional for who want it, i'm using them with another tool and they provide a lot of different keywords that we might miss with ngram/td-idf
- I'm still having some issue when i search a English keywords for the US, in the top 10 it will show some local serps sometimes, any solution? - Is possible to export in html all sites? at the moment only the personal report is html/excel - An option to export only the keywords in html/excel without opening the full research, the list is so long that it start lagging and sometimes it freeze - Can we select what we want in the full competitor research to speed up the process?
- Still no chance to have google nlp api? Maybe optional for who want
it, i'm using them with another tool and they provide a lot of different
keywords that we might miss with ngram/td-idf
whats that other app? i will try and have a look again
- I'm still having some issue when i search a English keywords for the
US, in the top 10 it will show some local serps sometimes, any solution?
please give a sample.
- Is possible to export in html all sites? at the moment only the personal report is html/excel
yes but that would be a bit useless as there is nothing to compare against!?
- Can we select what we want in the full competitor research to speed up the process?
Yes, you can click on configure and edit the ranking factor filters or right click on the factor to filter it out
- An option to export only the keywords in html/excel without opening
the full research, the list is so long that it start lagging and
sometimes it freeze
Where exactly is it loading slow? The ngram on full competitor research? You don't need this at all to get the ngram data. You can get it on main form with "Add->Extract from Website->Search"
- Still no chance to have google nlp api? Maybe optional for who want
it, i'm using them with another tool and they provide a lot of different
keywords that we might miss with ngram/td-idf
whats that other app? i will try and have a look again
- I'm still having some issue when i search a English keywords for the
US, in the top 10 it will show some local serps sometimes, any solution?
please give a sample.
- Is possible to export in html all sites? at the moment only the personal report is html/excel
yes but that would be a bit useless as there is nothing to compare against!?
- Can we select what we want in the full competitor research to speed up the process?
Yes, you can click on configure and edit the ranking factor filters or right click on the factor to filter it out
- An option to export only the keywords in html/excel without opening
the full research, the list is so long that it start lagging and
sometimes it freeze
Where exactly is it loading slow? The ngram on full competitor research? You don't need this at all to get the ngram data. You can get it on main form with "Add->Extract from Website->Search"
- Surferseo they usually show keywords from google nlp besides their suggestion http://prntscr.com/ufj5di - http://prntscr.com/ufj60a you can see some website have /it filter and some are just italian websites - Like you said it's useful when you have already something to compare but when we research some keywords/topic we have nothing to compare, we can see what is going on in the top 10 in that case - option -> edit filter it works, right click -> filter highlighted and remove, it doesn't remove anything on my end - It lags and freeze (sometimes) when i scroll down to the bottom the full competitor research. I just checked the ngram export from the main interface, it just show the full keyword list without all the info we have inside the full competitor research am i right? - is it still in the roadmap td-idf keywords beside the ngram?
- Like you said it's useful when you have already something to compare but when we research some
next update will allow you to export it in html and use the HTML report template as base
-> filter highlighted and remove, it doesn't remove anything on my end
what did you try to filter out? Some factors are dynamic.
- It lags and freeze (sometimes) when i scroll down to the bottom the
full competitor research. I just checked the ngram export from the main
interface, it just show the full keyword list without all the info we
have inside the full competitor research am i right?
yes, it just shows you the extracted keywords, not details on it. Though I might add some more details if you need it.
- is it still in the roadmap td-idf keywords beside the ngram?
well td-ldf is just a different metric on how important the ngram extracted keyword might be. I don't see how I can get more keywords extracted here.
Thanks @Sven - Regarding the keyword export, all the data inside the full report are pretty useful, thats the main reason - An option to filter keywords by words 1- 2 - 3 etc - I tried to filter out the whole security section, the server line and http version line, it doesn't remove anything. ( refreshed ). Can we have a search function or something to find what we want inside the filter list? http://prntscr.com/ufldai
- is it still in the roadmap td-idf keywords beside the ngram?
well td-ldf is just a different metric on how important the ngram extracted keyword might be. I don't see how I can get more keywords extracted here.
OK one freebie for you guys: A simple yet powerfull unsupervised algorithm for extracting keywords from texts is called ... TextRank ... (http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf). It is not as powerful as a good neural network, but its way more powerful than tf-idf wrangling. It uses graph-based ranking
algorithms (explicitly it uses the good old PageRank ) to natural language texts. So TextRank build a graph that represents the text, and interconnects words or other text entities with meaningful
relations. This "textgraph" is than evaluated with the pagerank algo:
Sven is right on that one, TextRank will not extract "more" Keywords, but it gives the better ranking among all the words and n-grams in the text in respect of importance/relevance to the text. More Keywords are not always better! You want the relevant keywords, not all keywords. If you want to extract all keywords, just extract all n-grams without a weight . One method to get more "good/important/relevant" keywords not in the text itself is to lookup the nearby keywords from the best TextRank keywords with an pretrained embedding matrix.
If sven can implement it, great. If you need an api let me know, i have this already in my repo and i can set it up for you after i have done the setup for the "Content-Categorizer". I can also setup a much more powerful keyword extractor based on deeplearning if you guys are interested.
@TOPtActicsI don't understand all the abbreviations but it sounds really very interesting. We really have to push in this direction and you seem to have studied the subject well. There seems to be something to recover from the side of OpenAI with Elon Musk's Gpt-3(4)
Elon Musk is just an investor from openai, like microsoft is too. GPT3 is closed so we cant rebuild it and it will be opened for ~400$/month. Its really cool what i see from the api playground of gpt3. By now the best general purpose model out there, but i think its to expensive and most of the tasks have to be "finetuned" on a specific problem set, to work good.
@TOPtActics thanks for the ideas here. Though the extraction of phrases and keywords is already good in my eyes. The labeling of what might be more important than others is something to look at right now. This algorithm can help here as well.
@Sven is possible at the moment to bulk quick competitor research? i have tried to select two or more but it always search the first keyword and open the top10 list or it's the same as using tool -> collect SEO score? manually searching one by one it takes few seconds while using the tool even with two keywords it takes few minutes
I find myself in a situation which could lead to a new option.
I have a list of keywords and I would like to be able to test their presence in a page of one of my sites.
It might be interesting to do a mass check of all the words so that you can only keep the words that are missing on the page.
I think that the check could be done directly on the html version (which would therefore include all the tags without distinction) and have in return only the missing words.
Comments
If Sven is interested in lifting up the captchabreaker game to AI level, i could help too. Its a combination of CNN (Convolutional Neural Networks) and RNN (Recurrent Neural Networks) networks, setted up behind as AI component. The user only needs to collect enough samples with labels (annotations) ~50-100 per Captchatype and press a button. A new net would automatically be trained on the camptcha type with high accurracy even on hard captchas. Sven wouldn't implement a python interpreter in the GSA products, so i think the only way to go is microservice approach via api.
Semantic Cocoon:
Some interesting links:
https://gephi.org/
https://medialab.sciencespo.fr/en/tools/navicrawler/
my suggestion would be to go with a similarity score from a BERT model between all Corpuses (corpus = the content from a page for example) and to know what the corpus is talking about, do a topic modelling (https://en.wikipedia.org/wiki/Topic_model) for every corpus, too.
If there is interest in building such an API or a implementation, let me know.
Here is an example:
https://www.nytimes.com/2020/08/18/dining/black-jam-makers.html -- Food
pasted first paragraph in
pasted original nyt categories in
the model got it right.
Another Example:
https://www.nytimes.com/2020/08/28/technology/microsoft-tiktok-lobbying.html -- Tech
Tech, Business, Politics & USA > 50% --> For me it seems the model got it right!
The special thing here is, normally you have to train a new neural net on every new label-set, here i can put in whatever label i like and the model will do its best to find a way the text similarity to the label (category in this example). This is totaly new in the game! One model to serve your needs for categorization.
For example i can just paste random new category i think about in my categroy-space like "app" and the model will handle it:
I'am now done coding a little test API (nothing for much load, no redis etc..., maybe it can handle 200 requests/min). Iam on holiday next week, maybe i send you guys a link the week after. This screen is a GET-API, i will only deploy a POST-API, because the sequences can get to big.
- Is possible to export in html all sites? at the moment only the personal report is html/excel
- An option to export only the keywords in html/excel without opening the full research, the list is so long that it start lagging and sometimes it freeze
- Can we select what we want in the full competitor research to speed up the process?
- http://prntscr.com/ufj60a you can see some website have /it filter and some are just italian websites
- Like you said it's useful when you have already something to compare but when we research some keywords/topic we have nothing to compare, we can see what is going on in the top 10 in that case
- option -> edit filter it works, right click -> filter highlighted and remove, it doesn't remove anything on my end
- It lags and freeze (sometimes) when i scroll down to the bottom the full competitor research. I just checked the ngram export from the main interface, it just show the full keyword list without all the info we have inside the full competitor research am i right?
- is it still in the roadmap td-idf keywords beside the ngram?
- Regarding the keyword export, all the data inside the full report are pretty useful, thats the main reason
- An option to filter keywords by words 1- 2 - 3 etc
- I tried to filter out the whole security section, the server line and http version line, it doesn't remove anything. ( refreshed ). Can we have a search function or something to find what we want inside the filter list? http://prntscr.com/ufldai
It's great. Thanks for sharing this new OnPage SEO functions
OK one freebie for you guys: A simple yet powerfull unsupervised algorithm for extracting keywords from texts is called ... TextRank ... (http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf). It is not as powerful as a good neural network, but its way more powerful than tf-idf wrangling. It uses graph-based ranking algorithms (explicitly it uses the good old PageRank ) to natural language texts. So TextRank build a graph that represents the text, and interconnects words or other text entities with meaningful relations. This "textgraph" is than evaluated with the pagerank algo:
Sven is right on that one, TextRank will not extract "more" Keywords, but it gives the better ranking among all the words and n-grams in the text in respect of importance/relevance to the text. More Keywords are not always better! You want the relevant keywords, not all keywords. If you want to extract all keywords, just extract all n-grams without a weight . One method to get more "good/important/relevant" keywords not in the text itself is to lookup the nearby keywords from the best TextRank keywords with an pretrained embedding matrix.
If sven can implement it, great. If you need an api let me know, i have this already in my repo and i can set it up for you after i have done the setup for the "Content-Categorizer". I can also setup a much more powerful keyword extractor based on deeplearning if you guys are interested.
https://www.cnbc.com/2020/07/23/openai-gpt3-explainer.html#:~:text=OpenAI first described GPT-3,and spam in vast quantities.