Refreshing Comments...

When scraping just behave as to not piss off the site owner - whatever that means. Eg. not cause excessive load or making sure you don't leak out sensitive data.

Next put yourself in their shoes and realize they don't usually monitor their traffic that much or simply don't care as long as you don't slow down their site. It's usually only certain big sites with heavy bot traffic such as linkedin or sneaker shoe sites which implement bot protections. Most others don't care.

Some websites are created almost as if they want to be scraped. The json api used by frontend is ridiculously clean and accessible. Perhaps they benefit when people see their results and invest in their stock. You never fully know if the site wants to be scraped or not.

The reality of scraping industry related to your question is this

1. scraping companies generally don't use real user agent such as 'my friendly data science bot' but they hide behind a set of fake ones and/or route the traffic through a proxy network. You don't want to get banned so stupidly easily by revealing user agent when you know your competitors don't reveal theirs.

2. This one is obvious. The general rule is to scrape over long time period continuously and add large delays between requests of at least 1 second. If you go below 1 second be careful.

3. robots.txt is controversial and doesn't serve its original purpose. It should be renamed to google_instructions.txt because site owners use it to guide googlebot to navigate their site. It is generally ignored by the industry again because you know your competitors ignore it.

Just remember the rule of 'not to piss off the site owner' and then just go ahead and scrape. Also keep in mind that you are in a free country and we don't discriminate here whether it is of racial or gender reasons or whether you are a biological or mechanical website visitor.

I simply described the reality of data science industry around scraping after several years of being in it. Note that this will probably not be liked by HN audience as they are mostly website devs and site owners.

You are correct that I don't like this advice... not because I find it to be wrong, but because you are approaching it solely from a competitive perspective -- "Your competitors don't have ethics, so you shouldn't either." That doesn't help someone who is engaging in research and trying to hold themselves to a higher standard.
I'm neither a web dev nor a site owner, but OP literally asked for tips on ethical web scraping, not "what's the most I can get away with".
That's because OP is operating under the wrong assumption that sites won't ban an ethical scraper. The reality is that they will, and much faster than an unethical one. They don't care about your science project, they want ad revenue, conversions...
1. is the only one I don't like. I think you should use your real user agent first on any given site, as a courtesy; whether you give up or change to a more "normal" user agent if you get banned is up to you.

Oh, and for 3.: if you can, apply some heuristics to your reading of the robots.txt. If it's just "deny everything", then ignore it, but you really don't want to be responsible for crawling all of the GET /delete/:id pages of a badly-designed site… (those should definitely be POST, and authenticated, by the way).

I disagree. The risks are similar to those of disclosing a security vulnerability to a company without a bug bounty. You cannot know how litigious or technically illiterate the company will be. What if they decide you're "hacking" them and call the FBI with the helpful information you included in your user agent? Crazier things have happened.

Anonymity is part of the right to privacy; IMO, such a right should extend to bots as well. There should be no shame in anonymously accessing a website, whether via automated means or otherwise.

> such a right should extend to bots as well

No, it very much shouldn't, but (as you probably meant) it should extend to the person (not, eg, company) using a bot, which amounts to the same thing in this case.

Also, if a target site is behind Cloudflare then you probably won’t be able to masquerade as any of the popular bots - they block fake google/yandex bots.
"or making sure you don't leak out sensitive data"

If sensitive data can be scraped, it is not really stored sensitive. So I would not care too much about it and just notify the owner if I notice it.

Keep in mind that if you end up with data that are protected under GDPR, merely having them puts you in a damning position. The intended owner will be fried for not protecting it adequately, but you violate GDPR since "I never agreed to you collecting, processing, etc" the data. And imagine the world of pain if you are caught with children's data.
Well, having a few websites of my own, I really do think that point 1 is the worst. I can't filter bots that disguise as users from my access logs, and they actually hurt my work (i.e. figuring out what people read).

Totally agree with the rest though. Maybe adapt the "large delay" of 1 second to the kind of website I'm scraping though.

Thanks for your feedback!

> I can't filter bots that disguise as users from my access logs, and they actually hurt my work (i.e. figuring out what people read).

If the bots aren't querying from residential IPs you could match their IPs to ASNs and then filter based on that to separate domestic and data center origins.

Ha, that's a good idea! Is there a list somewhere of the cidr blocks that are assigned to residential vs server farms? I mean, how can I tell an IP is residential?
The other way around may be easier, i.e. excluding known datacenter ranges. There are some commercial databases for that, i'm not sure if there are any free ones. But you can also do this manually by running a whois on an IP and then extracting the ranges from the whois response and caching then. Then you can look at the orgname or something like that. You can also download the whois databases from the RIRs, but they don't contain the information what kind of entities they are.

    $ dig +short reddit.com
    151.101.1.140


    $ whois 151.101.1.140

    NetRange:       151.101.0.0 - 151.101.255.255
    CIDR:           151.101.0.0/16
    OrgName:        Fastly
    [...]
So if you see a known hoster here then you can exclude it from your statistics.
What I've done in the past is to pull down all the IPs of request I see, filter by unique, do whois for each one of them (you're gonna need to have a backoff/rate limit here as whois services are usually rate limited) and save the organization name, ASN and CIDR blocks, again filter by uniqueness, then create a new list with the organizations of interest and match with the CIDR blocks. Now you have an allow/blocklist you can use.
It won’t help you learn to write a scraper, but using the common crawl dataset will get you access to a crazy amount of data without paying to acquire it yourself.

https://commoncrawl.org/the-data/

Cool, didn't know about this. Thanks!
> As part of my learning in data science, I need/want to gather data.

Also not web scraping, but a few other public data set sources to check.

https://registry.opendata.aws

https://github.com/awesomedata/awesome-public-datasets

Nice you to ask this question and think about how to be as considerate as you can.

Some other thoughts:

- Find the most minimal, least expensive (for you and them both) way to get the data you're looking for. Sometimes you can iterate through search results pages and get all you need from there in bulk, rather than iterating through detail pages one at at a time.

- Even if they don't have an official/documented API, they may very likely have internal JSON routes, or RSS feeds that you can consume directly, which may be easier for them to accommodate.

- Pay attention to response times. If you get your results back in 50ms, it probably was trivially easy for them and you can request a bunch without troubling them too much. On the other hand, if responses are taking 5s to come back, then be gentle. If you are using internal undocumented APIs you may find that you get faster/cheaper cached results if you stick to the same sets of parameters as the site is using on its own (e.g., when the site's front end makes AJAX calls)

That's great advice! Especially the one about response times. I didn't think of that, and will integrate it in my sleep timer :)
I always add an “Accept-Encoding” header to my request to indicate I will accept a gzip response (or deflate if available). Your http library (in whatever language your bot is in) probably supports this with a near trivial amount of additional code, if any. Meanwhile you are saving the target site some bandwidth.

Look into If-Modified-Since and If-None-Match/Etag headers as well if you are querying resources that support those headers (RSS feeds, for example, commonly support these, and static resources). They prevent the target site from having to send anything other than a 304, saving bandwidth and possibly compute.

> Meanwhile you are saving the target site some bandwidth.

And costing them some CPU :) It’s probably a good idea in most cases, agreed, but there are exceptions such as if you are requesting resources in already-compressed formats, like most image/video codecs.

Frankly, it would be difficult to find a part of your post that is correct.

1. You're never causing their server to do anything they didn't configure their server to do. Accept headers are merely information for the server telling them what you can accept: what they return to you is their choice, and they can weigh the tradeoffs themselves.

2. The tradeoff you think is happening isn't even happening in a lot of cases. In a lot of cases they'll be serving that up from a cache of some sort so the CPU work has already been done when someone else requested the page. CPU versus bandwidth isn't an inherent tradeoff.

This is baffling to me, since I’ve always thought of gzip (or other) compression being applied by the web server (or configured to do so) only to text formats like HTML, JS, CSS, etc. I’m curious to know which badly written server or sites compress already compressed content like images and videos just because a user agent says it’ll accept compressed content?
> And costing them some CPU

On the servers that have no purpose in life but to handle caching. I’d much rather browsers and scrapers alike hit my Apache Trafficserver instances with requests needing to return a Not Modified than wasting time of the app servers.

In addition to the steps you're already taking, and the ethical suggestions from other commenters, I suggest that you aquaint yourself thoroughly with intellectual property (IP) law. If you eventually decide to publish anything based on what you learn, copyright and possibly trademark law will come into play.

Knowing what rights you have to use material you're scraping early on could guide you towards seeking out alternative sources in some cases, sparing you trouble down the line.

I'm curious how this would be an issue; factual information isn't copyrightable, and most of the obvious things that I can think to do with a scraper amount to pulling factual information in bulk. Even if it's information like, "this is the average price for this item across 13 different stores". (Although I'm not a lawyer and only pay attention to American law, so take all of this with the appropriate amount of salt)
How much can you quote from a crawled document? Can you republish the entire crawl? What can you do under "fair use" of copyrighted material and what can't you do? Can you articulate a solid defense of your publication that it truly contains only pure factual information? Will BigCo dislike having its name associated with the study but can you protect yourself by limiting your publication to "nominative use" of its trademarks? What is the practical risk of someone raising a stink if the legality of your usage is ambiguous? Who actually holds copyright on the crawled documents?

You have a lot of rights and you can do a lot. Understanding those rights and where they end lets you do more, and with confidence.

So I think I just was being unimaginative on "scraping"; I wouldn't have thought to save quotes/prose, just things like word counts, processed results (sentiment analysis), pricing, etc. In which case most of that shouldn't come up, but yes I can see where other options are less simple.
> factual information isn't copyrightable

Tell that to Aaron Swartz.

Sure, if you think of factual information as an abstract concept. But as soon as you put that abstract concept into a concrete representation, that representation is absolutely copyrightable. And when you scrape data you're not scraping abstract information, you're scraping the representation of that information.

Try publishing PDFs of college textbooks online and see how well your "I'm just publishing factual information" argument works.

I'm not saying I agree with the law on this, and I'm also not saying that the way the law was intended should apply to the situation of scraping.

> Tell that to Aaron Swartz.

He wasn't downloading (purely) factual information, as I understood it.

> college textbooks

Not even remotely raw factual information. Heck, a table of numbers with a descriptive label probably is copyrightable, but you can scrape the table itself, yes?

I think the issue here is that I assumed a very narrow idea of what people would scrape; it hadn't crossed my mind to download prose or such, which I think is why we're arriving at different conclusions.

Simple,

respect robots.txt

find your data from sitemaps, ensure you query at a slow rate. robots.txt has a cool off period. See https://en.wikipedia.org/wiki/Robots_exclusion_standard#Craw...

example: https://www.google.com/robots.txt

Yeah that's a must do, but I think most websites don't even bother making a robots.txt beyond "please index us, google". However that wouldn't necessarily mean they're happy about someone vacuuming their whole website in a few days.
I think your main obligation is not to the entity from which you’re scraping the data, but the people whom the data is about.

For example, the recent case between LinkedIn and hiQ centered on the latter not respecting the former’s terms of service. But even if they had followed that to the T, what hiQ is doing — scraping people’s profiles and snitching to their employer when it looked like they were job hunting — is incredibly unethical.

Invert power structures. Think about how the information you scrape could be misused. Allow people to opt out.

That's a fair point indeed. I don't think I will ever expose non-anonymized data, because that's just too sensitive. But if I ever do, I'll make sure people are made aware they are listed, and that they can opt out easily.
I tried to find a source to back up what you’re saying about hiQ “snitching” to employers about employees searching for jobs, but all I can find is vague documentation about the legal suit hiQ v. LinkedIn.

Do you have a link to an article or something?

It's their actual product. Keeper.

> Keeper is the first HCM tool to offer predictive attrition insights about an organization's employees based on publicly available data.

Indirectly related, if you have some time to spare follow Harvard's course in ethics! [1]

Here is why: while it didn't teach me anything new (in a sense), it did give me a vocabulary to better articulate myself. Having new words to describe certain ideas means you have more analytical tools at your disposal. So you'll be able to examine your own ethical stance better.

It takes some time, but instead of watching Netflix (if that's a thing you do), watch this instead! Although, The Good Place is a pretty good Netflix show sprinkling some basic ethics in there.

[1] https://www.youtube.com/watch?v=kBdfcR-8hEY

Thanks for sharing that Harvard's course.

The cost benefit analysis part reminds me a lot of some of the comments you see here (and elsewhere) with regards to Covid-19 and the economic shutdown of societies. Quite timely.

My general attitude towards web scraping is that if I, as a user, have access to a piece of data through a web browser, the site owners have no grounds to object to me using a different program to access the data, as long as I’m not putting more load on their servers than a user clicking all the links would.

Obviously, there may be legal repercussions for scraping, and you should follow such laws, but those laws seem absurd to me.

Common CMS are fairly good at caching and can handle a high load, but quite often someone deems a badly programmed extension "mission critical". In that case one of your requests might trigger dozens of database calls. If multiple sites share a database backend, an accidental DOS might bring down a whole organization.

If the bot has a distinct IP (or distinct user agent), then a good setup can handle this situation automatically. If the crawler switches IPs to circumvent a rate limit or for other reasons, then it often causes trouble in the form of tickets and phone calls to the webmasters. Few care about some gigabytes of traffic, but they do care about overtime.

Some react by blocking whole IP ranges. I have seen sites that blocked every request from the network of Deutsche Telekom (Tier 1 / former state monopoly in Germany) for weeks. So you might affect many on your network.

So:

* Most of the time it does not matter if you scrape all information you need in minutes or overnight. For crawl jobs I try to avoid the time of day I assume high traffic to the site. So I would not crawl restaurant sites at lunch time, but 2 a.m. local time should be fine. If the response time goes up suddenly at this time, this can be due to a backup job. Simply wait a bit.

* The software you choose has an impact: If you use Selenium or headless Chrome, you load images and scripts. If you do not need those, analyzing the source (with for example beautiful soup) draws less of the server's resources and might be much faster.

* Keep track of your requests. A specific file might be linked from a dozen pages of the site you crawl. Download it just once. This can be tricky if a site uses A/B testing for headlines and changes the URL.

* If you provide contact information read your emails. This sounds silly, but at my previous work we had problems with a friendly crawler with known owners. It tried to crawl our sites once a quarter and was blocked each time, because they did not react to our friendly requests to change their crawling rate.

Side note: I happen to work on a python library for a polite crawler. It is about a week away from stable (one important bug fix and a database schema change for a new feature). In case it is helpful: https://github.com/RuedigerVoigt/exoskeleton

If you use Selenium & Chrome WebDriver you can disable loading images by : AddUserProfilePreference("profile.default_content_setting_values.images", 2)
as sort of a poor man's rate limiting, I have written spiders that will sleep after every request, for the length of the previous request (sometimes length of the request times a sleep factor that defaults to 1). My thinking is that if the site is under load, it will respond slower, and my spider will slow down as well.
This might be overboard for most projects but here is what I recently did. There is a website I use heavily that provides sales data for a specific type of products. I actually e-mailed to make sure this was allowed because they took down their public API a few years ago. They said yes everything that is on the website is fair game and you can even do it on your main account. It was actually a surprisingly nice response.
I work with a scientific institution and it's still amazing to me that people don't check or ask if there are downloadable full datasets that anyone can have for free. They just jump right in to scraping websites.

I don't know what kind of data you're looking for, but please verify that there isn't a quicker/easier way of getting the data than scraping first.

I've gone through this process twice- one about six months ago, and once just this week.

In the first event the content wasn't clearly licensed and the site as somewhat small, so I didn't want to break them. I emailed them and they gave us permission but only if we only crawled one page per ten seconds. Took us a weekend, but we got all the data and did so in a way that respected their site.

The second one was this last week and was part of a personal project. All of the content was over an open license (creative commons), and the site was hosted on a platform that can take a ton of traffic. For this one I made sure we weren't hitting it too hard (scrapy has some great autothrottle options), but otherwise didn't worry about it too much.

Since the second project is personal I open sourced the crawler if you're curious- https://github.com/tedivm/scp_crawler

My policy on scraping is to never use asynchronous methods. I've seen a lot of small e-commerce sites that can't really handle the load, even if it's a few hundred requests per second, and the server crashes. So even if it takes me longer to scrape a site I prefer to not cause any real harm on them as long as I can avoid it.
The suggestions in the comments are excellent. One thing I would add is this: contact the site owner in advance and ask for their permission. If they are okay with it or if you don't hear back, credit the site in your work. Then send the owner a message with where they can see the information being used.

Some sites will have rules or guidelines for attribution already in place. For example, the DMOZ had a Required Attribution page to explain how to credit them: https://dmoz-odp.org/docs/en/license.html. Discogs mentions that use of their data also falls under CC0: https://data.discogs.com/. Other sites may have these details in their Terms of Service, About page, or similar.

The rules you named are some I personally followed. One other extremely important thing is privacy when you want to crawl personal data like social networks. I personally avoid crawling data that inexperienced users might accidentally expose, like email adresses, phone numbers or their friends list. A good rule of thumb for social networks for me always was, that I only scrape the data that is visible when my bot is not logged in (also helps to not break the providers ToS).

The most elegant way would be to ask the site provider if they allow scraping their website and which rules you should obey. I was surprised how open some providers were, but some don't even bother replying. If they don't reply, apply the rules you set and follow the obvious ones like not overloading their service etc.

I tried the elegant way before, after creating a mobile application to find fuel pumps around the country for a specific brand. My request was greeted with a "don't publish; we're busy making one; we'll sue you anyway". I guess where I'm from, people don't share their data yet...

Totally agree with the point on accidental personal data, thanks for pointing that out!

PS: they never released their app...

It's helpful to filter out links to large content and downloadable assets from being traversed. For example, I assume you wouldn't care about downloading videos, images, and other assets that would otherwise use a large amount of data transfer and increase costs.

If the file type isn't clear, the response headers would still include the Content-Length for non-chunked downloads, and the Content-Disposition header may contain the file name with extension for assets meant to be downloaded rather than displayed on a page. Response headers can be parsed prior to downloading the entire body.

If all scrapers did what you did, I'd curse a lot less at $work. Kudos for that.

Re 2 and 3: do you parse/respect the "Crawl-delay" robots.txt directive, and do you ensure that works properly across your fleet of crawlers?

Hehe, my "fleet of crawlers" is a single machine in a closet so far :) I'll think about that kind of synchronization later.

However I do parse and respect the "crawl-delay" now, thanks for pointing it out!

A large fraction of websites with Crawl-Delay set it a decade ago and promptly forgot about it. No modern crawler uses it for anything other than a hint. The primary factors for crawl rate are usually site page count and response time.
Be careful about making the data you've scraped visible to Google's search engine scrapers.

That's often how site owners get riled up. They search for some unique phrase on Google, and your site shows up in the search results.

This isn't really an "ethical" practice, more like how to hide that you are scraping data practice. If you have to hide the fact that you are scraping their data, maybe you shouldn't be doing it in the first place.
Depends. Maybe, for example, you're doing some competitive price analysis and never plan on exposing scraped things like product descriptions...you only plan to use those internally to confirm you're comparing like products. But you expose it accidentally. Avoid that.
In some cases, especially during development, local caching of responses can help reduce load. You can write a little wrapper that tries to return url contents from a local cache and then falls back to a live request.
As many pages are at least half-way SPAs, make sure to really understand the website's communication with their backend. Identify API calls and try to make API calls directly instead of downloading the full pages and extracting the required information from HTML afterwards. If you have certain data sets from specific API calls that almost never change, try to crawl them less regularly and instead cache the results.
You may need to get more specific about your definition of "ethical".

For example, do you just mean "legal"? Or perhaps, consistent with current industry norms (which probably includes things you'd consider sleazy)? Or not doing anything that would cause offense to site owners (regardless of how unreasonable they may seem)?

I do think it's laudible that you want to do good. Just pointing out that it's not a simple thing.

Haven’t seen anyone mention this, but asking permission first is about the most ethical approach. If you think sites are unlikely to give you permission, that might be an indication that what you’re doing has limited value. Offering to share your results with them could be a good plan.

I work for a company that does a lot of web scraping, but we have a business contract with every company we scrape from.

Schema.org is a nice resource. If you can find that meta-data on a site, you can be just a little more sure they don’t mind getting that data scraped. It’s the instruction book for teaching google and other crawlers extra information and context. Your scraper would be wise to parse this extra meta information.
The only sound advice one can give is: there are two elements to consider: 1) ethics is different from law 1.1) the ethical way: respect robots.txt protocol 2) consult a lawyer 2.1) prior written consent, they will say, prevents you from being sued, and not much else.
IMO, the best practice is “don’t”. If you think the data you’re trying to scrape is freely available, contact the site owner, and ask them whether dumps are available.

Certainly, if your goal is “learning in data science”, and thus not tied to a specific subject, there are enough open datasets to work with, for example from https://data.europa.eu/euodp/en/home or https://www.data.gov/

Where this 'best practice is “don’t”' idea comes from? I saw it couple of times when scraping topic surfaces. I think that it is kind of hypocrisy and actually acting against own good and even good of the internet as whole because it artificially limits who can do what.

Why are there entities which are allowed to scrape web however they want (who got into their position because of scraping the web) and when it comes to regular Joe then he is discouraged from doing so?

In my book, “Not best practice” doesn’t imply “never do”, but web scraping should be your option of last resort. Doing it well takes ages, and time spent doing it will often detract you from your goal.

As I said, in this case “learning data science” likely doesn’t require web scraping; it just requires some suitable data set.

The OP claimed in another comment that that doesn’t exist, but (s)he doesn’t say what dats (s)he’s looking for, so that impossible to check.

I'm a lot more motivated to do data science on topics I actually care about :) Unfortunately those topics (or websites, in this case) don't expose ready-made databases or csv files.
I like this approach. Personally I wait an hour if I get an invalid response and use timeouts of a few seconds between other requests.
Ethical web scraping? Is that even a thing?
No, it's not and discussing it like it is a thing is irrational. Ethics are based on morals and morals are based on determining a "right" course of action for a given act.

Just because something is legal, by absence of law, doesn't mean it's right or fair for all cases. Just because something is illegal (copyright) doesn't mean it's not right or fair for all cases. What if the information saved a million lives? Would it still be ethical to claim "ownership" of that information?

What if the information caused a target audience to visualize that thing over and over again? Is it right to allow that information out into the public at all?

g'disable javascript in your browser'