I want to scrape and save URLs that have a certain folder...possible?

davidle · Aug 3, 2016

So, I want to collect a list of all domains on the Internet that have a certain folder.

For example: domain.com/gaming/

...If a domain has a subfolder of "gaming" (not my keyword), I'd like to save that domain/link in a text file or spreadsheet. And I'd like to do check every site on the Internet (as much as this is possible).

Is this remotely possible? Are there any apps I should check out? I know some programming basics, but would like to keep things fairly simple and use existing tools.

Wicked Ice · Aug 3, 2016

first of all you need to get the list of all the URLs on the internet containing that keyword. i don't know anyone has more extensive database other than companies like ahrefs, majestic seo, and moz. you will need to negotiate a deal with them. scraping them is the easier part.

davidle · Aug 3, 2016

Wicked Ice said:
first of all you need to get the list of all the URLs on the internet containing that keyword.

This is actually what I want to collect. I don't actually want the data or anything from the website (scraping is the wrong word). I just want a nice organized list of that URLs that have a directory matching my keyword. Something like:

domain1.com/mykeyword
domain2.com/mykeyword
domain9999.com/mykeyword
etc.

Is that even possible? Or is that something that has to be done manually? (Which would be a pain in the ass.)

BabyGotBacklink · Aug 3, 2016

My first thought would be a Scrapebox footprint with the inurl: /keyword/ operator then remove duplicate domains from the output. Should work

BabyGotBacklink · Aug 3, 2016

Actually no, it's this one that would work: site:*.com/gaming/

you could get a nice list from that very easily with scrapebox. Then you could switch out the .com for .org or whatever you need

davidle · Aug 4, 2016

BabyGotBacklink said:
Actually no, it's this one that would work: site:*.com/gaming/

you could get a nice list from that very easily with scrapebox. Then you could switch out the .com for .org or whatever you need

Sounds perfect. Now, if only I had Windows! It might be worth getting Windows/Scrapebox if Scrapebox does this well. Any idea how big the list might be? Thousands (assuming thousands of sites have the directory I'm looking for)?

I was looking at import.io and Blockspring to see if they can do anything like this.

There's a Chrome Extension called Link Clump that lets you mass copy links you highlight. So if you set your search results to return a list of 100 and not 10, and you search "insiteurl:keyword", I can come close. But, it's obviously manual and not perfect since insiteurl isn't showing me only directories that match my search term.

BabyGotBacklink · Aug 4, 2016

You could run scrapebox on a vps which is nice because it does not tie up your local resources. You'll need proxies, the faster you want to go the more proxies you'll end up consuming. I don't know what Mac guys use instead of scrapebox, I remember some Mac people were running scrapebox on virtual box.

As far as how big the list will be that depends on your keyword, scrapebox will keep scraping until it either runs out of results from the search engine or you tell it to stop.

davidle · Aug 5, 2016

BabyGotBacklink said:
You could run scrapebox on a vps which is nice because it does not tie up your local resources. You'll need proxies, the faster you want to go the more proxies you'll end up consuming. I don't know what Mac guys use instead of scrapebox, I remember some Mac people were running scrapebox on virtual box.

As far as how big the list will be that depends on your keyword, scrapebox will keep scraping until it either runs out of results from the search engine or you tell it to stop.

Thanks. I see that Scrapebox can be run on VPS. That would be ideal. Does anybody have any idea how good the VPS that Scrapebox recommends on their site is?

Yuma504 · Aug 5, 2016

davidle said:
Thanks. I see that Scrapebox can be run on VPS. That would be ideal. Does anybody have any idea how good the VPS that Scrapebox recommends on their site is?

Install Scrapebox on Amazon AWS. You'll have free use for one year. I can assure you, Amazon will perform way better than most available options and for the price, you can't beat it.

davidle · Aug 12, 2016

Yuma504 said:
Install Scrapebox on Amazon AWS. You'll have free use for one year. I can assure you, Amazon will perform way better than most available options and for the price, you can't beat it.

Worked perfectly. Thanks to you and BabyGotBacklink. I set up an Amazon EC2 instance, then used the exact footprint recommended by BabyGotBacklink and got back a shitload of URLs. Now, I'm wondering what else I can use AWS/Scrapebox for...so far I've been able to scrape the links for phone numbers and email, so that may or may not come in handy. Also harvesting the same footprint/keywords for .co/.net TLDs to get even more URLs.

Search

Search

I want to scrape and save URLs that have a certain folder...possible?

davidle

New member

Wicked Ice

Busy

davidle

New member

BabyGotBacklink

Google Tryna Find Me

BabyGotBacklink

Google Tryna Find Me

davidle

New member

BabyGotBacklink

Google Tryna Find Me

davidle

New member

Yuma504

fear mongering asshat

davidle

New member