WickedFire - Affiliate Marketing Forum - Internet Marketing Webmaster SEO Forum

Go Back   WickedFire - Affiliate Marketing Forum - Internet Marketing Webmaster SEO Forum > Free Section > Shooting The Shit

Shooting The Shit Anything goes, seriously. Come meet and network with your peers, it's a fun way to take a break out of your busy day of posting at other boring forums.


Welcome to the WickedFire - Affiliate Marketing Forum - Internet Marketing Webmaster SEO Forum forums.

You are currently viewing our boards as a guest which gives you limited access to view most discussions and access our other features. By joining our free community you will have access to post topics, communicate privately with other members (PM), respond to polls, upload content and access many other special features. Registration is fast, simple and absolutely free so please, join our community today!

If you have any problems with the registration process or your account login, please contact contact us.

Reply
 
LinkBack Thread Tools Display Modes
Old 01-21-2012, 07:27 PM   #1 (permalink)
Senior Member
 
Join Date: Nov 2011
Posts: 104
iTrader: 0 / 0%
genetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond repute
Checkthisout Testing Some Code - Free Data Scraping For All

I am coding a scraping API. It's for a service that I am going to launch. Development has been ongoing for 3 days so far.

Although the API front-end is not started yet, the back-end (processing) is in a late Alpha stage now, and is capable of handling scrapes for the source of any URL that doesn't require a POST. In simple terms it cannot scrape data from a page that requires form input (yet).

I would like to test what I have implemented so far with real scraping requests, as I'm not sure if I am going too easy on myself with the tests I have come up with so far.

If anybody has any sites that they would like some data scraped from, I can do this for you (FREE!) as part of my testing.

The details I require are as follows -

Starting URL - The URL of "Page 1" of the scrape, subsequent URLs will be calculated by the script. If the URL needs to have variable parameters passed to it (eg. example.com?example=parameter) then include the list of parameters that need to be used.

Required Data - Some sort of identifier for the data that should be scraped, for example if the data required is an email address and it is labelled "E-Mail Addy" on the target website, then "E-Mail Addy" would be the required data. Of course it is fine to have multiple pieces of required data.

One thing I will mention is that this is not an unlimited offer, in that I am not necessarily offering it to unlimited people, nor am I offering to scrape an unlimited amount of data. I think a reasonable amount of URLs to scrape is 500. Remember this is a fully hosted service.

Any questions, or requests for scrapes, just reply in this thread. If anything you want to scrape is non-public, I'm sure you know where the PM button is!
genetic is offline   Reply With Quote
Old 01-21-2012, 07:36 PM   #2 (permalink)
Senior Botter
 
dchuk's Avatar
 
Join Date: Oct 2008
Location: San Diego,CA
Posts: 5,466
iTrader: 35 / 100%
dchuk has a reputation beyond reputedchuk has a reputation beyond reputedchuk has a reputation beyond reputedchuk has a reputation beyond reputedchuk has a reputation beyond reputedchuk has a reputation beyond reputedchuk has a reputation beyond reputedchuk has a reputation beyond reputedchuk has a reputation beyond reputedchuk has a reputation beyond reputedchuk has a reputation beyond repute
what'd ya build it with?
dchuk is offline   Reply With Quote
Old 01-21-2012, 07:40 PM   #3 (permalink)
Automation Specialist
 
Bofu2U's Avatar
 
Join Date: May 2007
Location: Baltimore, Murdaland
Posts: 8,291
iTrader: 78 / 100%
Bofu2U has a reputation beyond reputeBofu2U has a reputation beyond reputeBofu2U has a reputation beyond reputeBofu2U has a reputation beyond reputeBofu2U has a reputation beyond reputeBofu2U has a reputation beyond reputeBofu2U has a reputation beyond reputeBofu2U has a reputation beyond reputeBofu2U has a reputation beyond reputeBofu2U has a reputation beyond reputeBofu2U has a reputation beyond repute
So wait, is this like a .. I give you a URL, and it scrapes the entire site or... what?
Bofu2U is online now   Reply With Quote
Old 01-21-2012, 07:40 PM   #4 (permalink)
Senior Member
 
xpathfucker's Avatar
 
Join Date: Jun 2011
Posts: 229
iTrader: 0 / 0%
xpathfucker has a reputation beyond reputexpathfucker has a reputation beyond reputexpathfucker has a reputation beyond reputexpathfucker has a reputation beyond reputexpathfucker has a reputation beyond reputexpathfucker has a reputation beyond reputexpathfucker has a reputation beyond reputexpathfucker has a reputation beyond reputexpathfucker has a reputation beyond reputexpathfucker has a reputation beyond reputexpathfucker has a reputation beyond repute
I love scrapping. It's like licking your cousins tits. You know it's wrong but it really tastes so sweet... Just can't help myself.

Anyway, can you scrape images too or just simple text?
xpathfucker is offline   Reply With Quote
Old 01-21-2012, 07:41 PM   #5 (permalink)
Senior Member
 
Join Date: Sep 2010
Posts: 323
iTrader: 17 / 100%
Jake232 has a reputation beyond reputeJake232 has a reputation beyond reputeJake232 has a reputation beyond reputeJake232 has a reputation beyond reputeJake232 has a reputation beyond reputeJake232 has a reputation beyond reputeJake232 has a reputation beyond reputeJake232 has a reputation beyond reputeJake232 has a reputation beyond reputeJake232 has a reputation beyond reputeJake232 has a reputation beyond repute
Quote:
Originally Posted by dchuk View Post
what'd ya build it with?
Probably Anarchid
Jake232 is offline   Reply With Quote
Old 01-21-2012, 07:44 PM   #6 (permalink)
Senior Member
 
Join Date: Nov 2011
Posts: 104
iTrader: 0 / 0%
genetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond repute
It's all in PHP. I'm sure many will feel that it isn't the best language to do something like this in, but it's what I know best, and (so far) it seems to be working well.

You got any sites for me to scrape dchuk? :P
genetic is offline   Reply With Quote
Old 01-21-2012, 07:51 PM   #7 (permalink)
Senior Botter
 
dchuk's Avatar
 
Join Date: Oct 2008
Location: San Diego,CA
Posts: 5,466
iTrader: 35 / 100%
dchuk has a reputation beyond reputedchuk has a reputation beyond reputedchuk has a reputation beyond reputedchuk has a reputation beyond reputedchuk has a reputation beyond reputedchuk has a reputation beyond reputedchuk has a reputation beyond reputedchuk has a reputation beyond reputedchuk has a reputation beyond reputedchuk has a reputation beyond reputedchuk has a reputation beyond repute
Quote:
Originally Posted by genetic View Post
It's all in PHP. I'm sure many will feel that it isn't the best language to do something like this in, but it's what I know best, and (so far) it seems to be working well.

You got any sites for me to scrape dchuk? :P
nah, I'm good, I have my own scrapers when I need them (not knocking what you've done, just leaving a spot open for someone else who needs scraping done)

I wrote this in ruby, might want to play with it, seems like it would be useful for what you're doing: http://github.com/dchuk/Arachnid
dchuk is offline   Reply With Quote
Old 01-21-2012, 07:54 PM   #8 (permalink)
Senior Member
 
Join Date: Nov 2011
Posts: 104
iTrader: 0 / 0%
genetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond repute
Several replies since I replied to dchuk.

@Bofu2U This is a lot more "logical" than that. For example you want all the "First Name" entries from a site. You give me the URL of page one of the listing, and the script grabs every "First Name" entry throughout listing, including entries on any pagination. More complicated scrapes are possible, but this is the basic idea.

@xpathfucker "it's wrong but it really tastes so sweet..." <-- THIS. The script can scrape text, but if it is scraping images it can upload them to a specified FTP sever or return their original URLs.

@Jake232 Eh?
genetic is offline   Reply With Quote
Old 01-21-2012, 07:57 PM   #9 (permalink)
Automation Specialist
 
Bofu2U's Avatar
 
Join Date: May 2007
Location: Baltimore, Murdaland
Posts: 8,291
iTrader: 78 / 100%
Bofu2U has a reputation beyond reputeBofu2U has a reputation beyond reputeBofu2U has a reputation beyond reputeBofu2U has a reputation beyond reputeBofu2U has a reputation beyond reputeBofu2U has a reputation beyond reputeBofu2U has a reputation beyond reputeBofu2U has a reputation beyond reputeBofu2U has a reputation beyond reputeBofu2U has a reputation beyond reputeBofu2U has a reputation beyond repute
Quote:
Originally Posted by genetic View Post
Several replies since I replied to dchuk.

@Bofu2U This is a lot more "logical" than that. For example you want all the "First Name" entries from a site. You give me the URL of page one of the listing, and the script grabs every "First Name" entry throughout listing, including entries on any pagination. More complicated scrapes are possible, but this is the basic idea.
What if I give you one page with an example of the structure (and what I want, like First name) and tell you I want that information from every page on the entire site?
Bofu2U is online now   Reply With Quote
Old 01-21-2012, 08:00 PM   #10 (permalink)
Senior Member
 
Join Date: Nov 2011
Posts: 104
iTrader: 0 / 0%
genetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond repute
Ahh, it seems that Jake232 was talking about Arachnid by dchuk.

@dchuk Yeah, I appreciate that a lot of people will have their own solutions for any scraping requirements that they may have, but at the same time there are many who do not have access to this kind of resource. This is the gap(?) in the market that I am looking to fill with this service. Thanks for the sharing the code for Arachnid, I will have to have a look, and at the very least understand the approach you have taken. Hopefully I can learn something from your methods!
genetic is offline   Reply With Quote
Old 01-21-2012, 08:02 PM   #11 (permalink)
Senior Member
 
xpathfucker's Avatar
 
Join Date: Jun 2011
Posts: 229
iTrader: 0 / 0%
xpathfucker has a reputation beyond reputexpathfucker has a reputation beyond reputexpathfucker has a reputation beyond reputexpathfucker has a reputation beyond reputexpathfucker has a reputation beyond reputexpathfucker has a reputation beyond reputexpathfucker has a reputation beyond reputexpathfucker has a reputation beyond reputexpathfucker has a reputation beyond reputexpathfucker has a reputation beyond reputexpathfucker has a reputation beyond repute
Quote:
Originally Posted by genetic View Post
Several replies since I replied to dchuk.
The script can scrape text, but if it is scraping images it can upload them to a specified FTP sever or return their original URLs.
Ok cool. May I suggest you include a way to scrape stuff that require post requests (which is quite easy anyway), cause that is where some powerfull things can be made with PHP.

No urls to give u, but u sure have a nice offer. Have fun scrapping.
xpathfucker is offline   Reply With Quote
Old 01-21-2012, 08:05 PM   #12 (permalink)
Senior Member
 
Join Date: Nov 2011
Posts: 104
iTrader: 0 / 0%
genetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond repute
Quote:
Originally Posted by Bofu2U View Post
What if I give you one page with an example of the structure (and what I want, like First name) and tell you I want that information from every page on the entire site?
This is exactly the objective I am working to achieve. Nice and automated, define the URL to scrape, and then grab the required data from the entire site so reliably that it can be handled by other scripts or even (spun? [and]) used as content.
genetic is offline   Reply With Quote
Old 01-21-2012, 09:09 PM   #13 (permalink)
Automation Specialist
 
Bofu2U's Avatar
 
Join Date: May 2007
Location: Baltimore, Murdaland
Posts: 8,291
iTrader: 78 / 100%
Bofu2U has a reputation beyond reputeBofu2U has a reputation beyond reputeBofu2U has a reputation beyond reputeBofu2U has a reputation beyond reputeBofu2U has a reputation beyond reputeBofu2U has a reputation beyond reputeBofu2U has a reputation beyond reputeBofu2U has a reputation beyond reputeBofu2U has a reputation beyond reputeBofu2U has a reputation beyond reputeBofu2U has a reputation beyond repute
Quote:
Originally Posted by genetic View Post
This is exactly the objective I am working to achieve. Nice and automated, define the URL to scrape, and then grab the required data from the entire site so reliably that it can be handled by other scripts or even (spun? [and]) used as content.
That works. If you want a nice stress test and have plenty of proxies let me know.
Bofu2U is online now   Reply With Quote
Old 01-21-2012, 09:23 PM   #14 (permalink)
wut
 
eliquid's Avatar
 
Join Date: May 2007
Posts: 4,845
iTrader: 61 / 100%
eliquid has a reputation beyond reputeeliquid has a reputation beyond reputeeliquid has a reputation beyond reputeeliquid has a reputation beyond reputeeliquid has a reputation beyond reputeeliquid has a reputation beyond reputeeliquid has a reputation beyond reputeeliquid has a reputation beyond reputeeliquid has a reputation beyond reputeeliquid has a reputation beyond reputeeliquid has a reputation beyond repute
if your using php, are you using curl, raw sockets, or fopen?

if curl, are you using an asyn/non blocking version of it? how about polling/que?

just curious, I've been testing up a nonblocking multi curl script that has a que....
__________________
.
.
Spam and Quality Link Diversity Done Right
eliquid is offline   Reply With Quote
Old 01-22-2012, 12:43 AM   #15 (permalink)
Microwaving Toast
 
Join Date: Apr 2009
Location: random_location()
Posts: 2,333
iTrader: 79 / 99%
mattseh has a reputation beyond reputemattseh has a reputation beyond reputemattseh has a reputation beyond reputemattseh has a reputation beyond reputemattseh has a reputation beyond reputemattseh has a reputation beyond reputemattseh has a reputation beyond reputemattseh has a reputation beyond reputemattseh has a reputation beyond reputemattseh has a reputation beyond reputemattseh has a reputation beyond repute
Man, I hate when people name drop their own open source scrapers.





















https://github.com/mattseh/python-web
__________________
You have a right to obey.
mattseh is offline   Reply With Quote
Old 01-22-2012, 04:50 AM   #16 (permalink)
Beach Bum
 
Berto's Avatar
 
Join Date: Jan 2009
Location: Proudly Incorporated in WY
Posts: 2,679
iTrader: 54 / 100%
Berto has a reputation beyond reputeBerto has a reputation beyond reputeBerto has a reputation beyond reputeBerto has a reputation beyond reputeBerto has a reputation beyond reputeBerto has a reputation beyond reputeBerto has a reputation beyond reputeBerto has a reputation beyond reputeBerto has a reputation beyond reputeBerto has a reputation beyond reputeBerto has a reputation beyond repute
DO WANT. Will buzz you whenever I sober up - Thanks!
__________________
Always down to buy, sell, and trade health-related links. PM Me.
I use SerpIQ and Micro Site Masters Rank Tracker
Berto is offline   Reply With Quote
Old 01-23-2012, 10:12 AM   #17 (permalink)
Senior Member
 
Join Date: Nov 2011
Posts: 104
iTrader: 0 / 0%
genetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond reputegenetic has a reputation beyond repute
Thanks for the PMs, had an interesting eBay scraping request that has made me rethink the way that I am doing things. Basically, the system now caches pages that are being scraped. If anybody else has any scraping requests, then please send me details and I will have a look!

@Bofu2U Sweet. I haven't started working on proxy support yet, but when I'm there I'll hit you up.

@eliquid I am using file_get_contents() for simple requests and CURL for requests where I need to send header information. I will have a look at asynchronous non-blocking requests. The system does have a queue, and can be set to process a defined number of jobs per minute.

@mattseh Thanks for sharing that, I will have a look.

@Berto Cheers for the PM, I will drop you a reply in a minute.
genetic is offline   Reply With Quote
Old 01-23-2012, 10:15 AM   #18 (permalink)
Automation Specialist
 
Bofu2U's Avatar
 
Join Date: May 2007
Location: Baltimore, Murdaland
Posts: 8,291
iTrader: 78 / 100%
Bofu2U has a reputation beyond reputeBofu2U has a reputation beyond reputeBofu2U has a reputation beyond reputeBofu2U has a reputation beyond reputeBofu2U has a reputation beyond reputeBofu2U has a reputation beyond reputeBofu2U has a reputation beyond reputeBofu2U has a reputation beyond reputeBofu2U has a reputation beyond reputeBofu2U has a reputation beyond reputeBofu2U has a reputation beyond repute
Quote:
Originally Posted by genetic View Post
@Bofu2U Sweet. I haven't started working on proxy support yet, but when I'm there I'll hit you up.
Sounds good. You'll need about 1500-2000 for my test if you're game.
Bofu2U is online now   Reply With Quote
Old 01-23-2012, 10:17 AM   #19 (permalink)
 
Join Date: Jun 2006
Location: Boston, MA
Posts: 2,211
iTrader: 10 / 100%
CLKeenan has a reputation beyond reputeCLKeenan has a reputation beyond reputeCLKeenan has a reputation beyond reputeCLKeenan has a reputation beyond reputeCLKeenan has a reputation beyond reputeCLKeenan has a reputation beyond reputeCLKeenan has a reputation beyond reputeCLKeenan has a reputation beyond reputeCLKeenan has a reputation beyond reputeCLKeenan has a reputation beyond reputeCLKeenan has a reputation beyond repute
Sent a PM as well
__________________
Interested in guest posting on a 9 year old domain in the education niche? PM me with what you have to offer. Looking for other guest posts, written content, design/dev work, and general SEO work.
CLKeenan is offline   Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On



All times are GMT -4. The time now is 07:54 PM.


WickedFire.com Copyright © 2012 - WickedFire is an international registered Trademark of Coastal Synergy LLC. You may not use any of our trademarks, copyrights, content, or images without a written approval by members of Coastal Synergy LLC.

Search Engine Optimization by vBSEO 3.6.0