Does/can/how gsa grab all commentable urls from a blog comment based platform?

googlealchemist · June 2024

If I have one, or more, urls on a blog comment based platform, and/or just the root domain, that allows comment links to be dropped...does gsa already have this function, or could it be implemented, that it will grab all the other internal urls on that domain that also are open to comments?

Or do i have to check all the subbmitted/verified urls and sort by the blog comment related engines, and then do a site: or link extractor function in scrapebox, or whatever, to grab them all myself?

And before the Karen newbies chime in, im not worried about too many links or diminishing returns from the same domain blah blah blah.

Thanks

cherub · June 2024

For blog comments, once I have a decent sized verified list, I extract all the domains and then scrape site:domain.com for each of them, adding in some stopwords to try and get around the limitations Google gives on those sorts of searches. This usually gives me a load of other blog comment urls from those domains.

Sven · June 2024

There is a script command for doing this per engine, but got blog comments it is a general engine that would not work for specific ones. I need to look into this then.

googlealchemist · July 2024

cherub said:

For blog comments, once I have a decent sized verified list, I extract all the domains and then scrape site:domain.com for each of them, adding in some stopwords to try and get around the limitations Google gives on those sorts of searches. This usually gives me a load of other blog comment urls from those domains.

thanks, thats what i figured my plan would be as well.
what do u mean about the stop words to get around google limitations?

googlealchemist · July 2024

Sven said:

There is a script command for doing this per engine, but got blog comments it is a general engine that would not work for specific ones. I need to look into this then.

thatd be awesome if u could automate this in some way...

not sure what u mean about the specific vs general engines though, i see about 20 different specific types of blog comments under the blog comment section?

cherub · July 2024

googlealchemist said:

cherub said:

For blog comments, once I have a decent sized verified list, I extract all the domains and then scrape site:domain.com for each of them, adding in some stopwords to try and get around the limitations Google gives on those sorts of searches. This usually gives me a load of other blog comment urls from those domains.

thanks, thats what i figured my plan would be as well.
what do u mean about the stop words to get around google limitations?

Google will only display around 300-400 results max for any search, no matter if they say they have thousands of results available. Adding simple keywords or stopwords, negative keywords etc will usually bring back a slightly different set of results with possibly new urls not previously given.

googlealchemist · July 2024

cherub said:

googlealchemist said:

cherub said:

For blog comments, once I have a decent sized verified list, I extract all the domains and then scrape site:domain.com for each of them, adding in some stopwords to try and get around the limitations Google gives on those sorts of searches. This usually gives me a load of other blog comment urls from those domains.

thanks, thats what i figured my plan would be as well.
what do u mean about the stop words to get around google limitations?

Google will only display around 300-400 results max for any search, no matter if they say they have thousands of results available. Adding simple keywords or stopwords, negative keywords etc will usually bring back a slightly different set of results with possibly new urls not previously given

never realized they throttled site: searches to grab all the inner urls, as long as i had enough proxies to avoid the issue there, thanks for the tip!

googlealchemist · January 30

Sven said:

There is a script command for doing this per engine, but got blog comments it is a general engine that would not work for specific ones. I need to look into this then.

@Sven whats the current status on this function? im trying to figure out the most efficient way to scrape/import/identify new targets for ser. if i just trim everything to the root domain and import them that way...or if scrapebox just harvests the root domain vs an inner page...it would identify that root as the right engine i think?, but what if the homepage doesnt have a commentable post on it, but inner pages do?

or do i need to use ser or pi to identify all the sites initially based on the root domain, then any comment related engines, i then have to extract all inner urls using scrapebox site: or their link extractor addon, and then import/ident/test all those inner urls that way?

similar q for image comments, i think.

but not so much an issue with guestbook as they tend to have a sub form on any/all pages unless im misunderstanding that too.

thanks

Sven · January 31

If you import a root domain only, it will most likely not identify as a blog comment as it is missing the form and fields that are typically used to identify a comment platform. SER will try to find a link and check if it can post there, but that is not always successful. So if you can, simply use the full URL...SER can always find the correct URL to login/register/post from there.

Does/can/how gsa grab all commentable urls from a blog comment based platform?

Comments