TSM - Ethical and Legal Considerations in Web Scraping at Scale

Vlad Precup - Engineering Manager @ ComplyAdvantage

Data being the most valuable commodity nowadays, it lies at the core of almost every business, irrespective of its domain. Whether we're talking about search engines, entertainment platforms, ecommerce stores or fintech solutions, they are all about data and represent it in the most meaningful way to their users. In the process of gathering these vast amounts of data, many organizations are leveraging web scraping to extract information from 3rd party sources, such as websites. In most cases, the existing information on these websites is designed for human end-users. Automated information extraction agents are a best-effort approach of turning the human-readable result into specific data structures that can be further processed by internal systems before being researched upon or commercialized under various forms, either directly or indirectly. This article aims at identifying some of the most important ethical and potential legal implications of leveraging this technique for the above-mentioned purposes as well as exploring best practices employed in the volume data collection space. Under no circumstances does it constitute legal advice. To ensure that your project meets legal and regulatory requirements, it is recommended that you seek professional legal advice.

Legal Implications

Since the accelerated adoption of web scraping for empowering data services and applications, there have not been any globally applicable legal terms with regards to the use of this practice.

However, a minimal set of generally-applicable guidelines has emerged since then. For example, in a dedicated article from 2011, a business representative of Infochimps, a data selling company, acquired by CSC in 2013, mentioned that there were 3 key legal aspects that should be taken into consideration when scraping data from 3rd parties:

  1. Copyright

  2. Terms of service

  3. Trespass to chattels.

Copyright

A powerful quote from the article mentioned above is:"Facts and ideas are not copyrightable. However, expressions or arrangements of facts may be copyrightable." This means that the first thing that companies need to ensure is that they are only collecting facts from the data sources and nothing else.

Terms of Service

In a nutshell, terms of service (ToS) represent a legal binding between the service provider (i.e. the website) and the consumer (i.e. the user / scraper) regarding the usage of the provided services (in our case, the information published by this website). First of all, this legal binding is specific to each service. There are no general ToS for websites for example. Furthermore, in many cases terms of service lie in a grey area of enforceability due to their volatility without notice or lack of explicit user consent (i.e. the stipulation that the continued use of the service automatically implies the consent to the ToS). This is why data extraction agents should treat each source separately based on its ToS.

Trespass to Chattels

While copyright and terms of service are commonly-used and widely-known terms in the world of (software) engineering, "trespass to chattels" is a less frequently used term, as it was borrowed from the seven types of international torts of the common law. It refers to the intentional interference with another person's possession or property. For example, when it comes to web scraping, a DoS (Denial of Service) caused by a crawler / scraper which puts a high load on the website or an unauthorised access to private information are two instances that can be considered "trespass to chattels".

While these three aspects establish a regulatory baseline for much of the existing content on the internet, when considering web scraping, one should bear in mind that this is merely a USA-based guideline. This guideline may be subject to change in a different jurisdiction. Furthermore, depending on the jurisdiction, as well as the court judging the trials, historically, the rulings were either against or in favor or the authors of these practices. For example, while the Texas trial court ruled in favor of the plaintiff, American Airlines, in its case against FareChase, a fare comparison website which scraped prices from AA's website, the Danish Maritime and Commercial Court in Copenhagen ruled in a case against ofir.dk that the data extraction practices of the defendant, ofir.dk, from the plaintiff's website, home.dk, did not conflict with the Danish law or the EU database directive.

A few regulatory additions/variations built on this baseline, such as the GDPR or CNIL, will be introduced in the "Case Studies" section.

Ethical Considerations

Besides the legal aspects covered above, this paper covers the ethical issues that may arise when crawling and/or scraping external websites. Although presented from an academic research standpoint, the following also apply for products with a commercial end goal. The possible harmful consequences of web scraping can be of the following nature:

  1. Individual Privacy

  2. Organizational Privacy and Trade Secrets

  3. Diminishing Value for the Organization.

Since there is a general challenge in establishing a clear boundary between law and ethics, many times law being derived from or supporting ethical principles, the author suggests that a range of questions be addressed in order to make a web scraping project both legal and ethical:

Case Studies

Government websites - their mission is to provide transparency to the people accessing them, hence there should not be any legal consequences to extracting information from these portals. This statement relies on the fact that government information management follows strict classification procedures. Hence, information which is not classified is therefore public information. If, however, hacking was performed to access information which otherwise would not be accessible from a regular user's perspective, this would be unethical and pose serious legal consequences.

Search engines (Google, Bing etc) - they obey the Robots Exclusion Protocol, which is a non-standardised method (yet) by which webmasters express their preference for limiting or preventing these engines from crawling (& scraping) their pages. Explaining how this protocol works is outside the scope of the present article, but more information about the robots.txt file can be found on their official page. Also, robots meta tags / definitions headers are explained in great detail in Google's developer reference.

Scraping public LinkedIn data is [at least at the time of writing] legal - according to the US Court of Appeals, that denied LinkedIn's request to prevent hiQ from scraping its data in 2019. More details regarding LinkedIn's petition to the Supreme Court in 2020 as well as hiQ's opposition can be found in this article.

From a data scientist's perspective (see Tom Waterman's article, Data Scientist @ Facebook), here are a couple key points:

  1. Copyrighted data is not allowed to be replicated on other websites that scraped it from the source.

  2. Scraping data behind authentication is illegal, as it is directly subject to Terms of Service. In contrast, publicly available sites "cannot require a user to agree to any Terms of Service before accessing the data".

Scraping under the GDPR regulations now requires much more attention than before. In this blog post, the points of view of the Head of Legal at a leading scraping service company, ScrapingHub, are presented in this regard. In a nutshell, companies need to assess if they are scraping personal data and if that data belongs to EU residents. If this is the case, they need to justify this action with a lawful reason. Even under these conditions, scraping should be performed under very strict regulations.

Scraping from sources based in France is also more complicated with the April 2020 CNIL guidelines. As this article describes it, France is among the first countries to bring web scraping under regulation. The new regulations are in accordance with EU's GDPR and according to the article, the web scrapers [companies] should "provide the following information to anyone whose data they collect:

Web Scraping Guidelines

With all these principles to keep in mind and the specificities of various jurisdictions and data sources, it is not a straightforward task for data engineering teams to start such a project confidently in order to abide by the regulations while also reaching their goals.

Finddatalab is a web scraping [consultancy] company founded by a team of data scientists. Based on their 10 years+ experience, they've compiled a list of high-level tips with further explanations that aim at ensuring web scraping is both legal and ethical. ScrapingHub also presents this set of guidelines under a different form. Here's a quick glimpse at the list:

  1. Make sure that the purpose of web scraping is legal

  2. Make sure that you want to get publicly available information

  3. Check copyrights

  4. Set the optimal web scraping rate

  5. Direct your web scrapers a path similar to the search engine (respect the robots.txt file)

  6. Identify your web scrapers

Furthermore, in this article from towardsdatascience.com, J. Densmore mentions a few additional key points which are covering mostly the ethical side:

Finally, ScraperHero, another data company, gets more in-depth with a handy set of technical guidelines on how to avoid being blocked by websites during web scraping.

Scraping Data at Scale

Now that principles and industry practices have been introduced, one of the key differentiators for companies in the data landscape being the volume of their data. In order to collect more data, this tedious process needs to scale.

When it comes to data acquisition projects in large data companies, developing automatic data scraping agents can easily become a bottleneck, both in terms of data pipeline architecture and engineers required to develop, test and maintain each single web scraper. It was found that due to changes in website formats, website scrapers need a complete rewriting every three years on average. Due to the effort required by managing the lifecycle of these agents, a linear increase in the engineers working on such agents would be required, which in most cases can become infeasible for companies.

This blogpost authored by the ScrapingHub covers the aspects that need to be taken into account when thinking about scaling out such a system. The first problem that should be addressed is the ever changing nature of website formats. This can be potentially addressed by generalizing data extraction instead of having granular scrapers per data source. This will enable an accelerated data ingestion pipeline which would require a scalable pipeline architecture. But this is not all. Data collection should be also optimized by ensuring that only the required information is extracted (for example, by avoiding loading the pipeline with bloatware data such as images, auxiliary page resources, ads or comments appearing underneath articles). Last but not least, in order to speed up the quality assurance process (i.e. testing and maintaining data extraction agents), it should shift from being a reactive process to a proactive one, by automating data validation as well as anomaly and inconsistency detection. For example, instead of having quality assurance engineers perform repetitive, error-prone manual checks on extracted data from each new source, they should be empowered to monitor data inconsistencies identified automatically and to have an integrated mechanism to fix them with minimal effort.

Conclusion

While there are no clear legal boundaries regarding web scraping and there are numerous variations and interpretations of law in different jurisdictions, there is some information to conclude that there are commonly recommended guidelines and practices in the industry that should be considered when tackling data collection endeavors under a legal and ethical framework. Once these guidelines are acknowledged and embedded into the data engineering practices, scaling team effort does not necessarily require scaling a data engineering organization, but rather changing the perspective to a more automated and intelligent manner of managing the complexities that come with such endeavors.

Bibliography

[1] A. Watters, What It Takes To Find, Scrape And Sell Big Data

[2] Arhivă, Preliminary injunction in American Airlines v. Farechase Inc., Electronic Frontier Foundation, 2003

[3] Arhivă, Udskrift af sø- & handelsrettens dombog (home a/s mod OFIR a-s), Wayback Machine (sursa originală: bvhd.dk), 2006

[4] V. Krotov, L. Silva, Legality and Ethics of Web Scraping, Twenty-fourth Americas Conference on Information Systems, 2018

[5] About /robots.txt, The Web Robots Pages / robotstxt.org

[6] Robots meta tag, data-nosnippet, and X-Robots-Tag specifications, Google Search Developers Referenc

[7] hiQ Files Opposition Brief with Supreme Court in LinkedIn CFAA Data Scraping Dispute, The National Law Review, 2020

[8] T. Waterman, Web scraping is now legal, Medium, 2020

[9] I. Kerins, GDPR COMPLIANCE FOR WEB SCRAPERS: THE STEP-BY-STEP GUIDE, The ScrapingHub Blog, 2018

[10] FindDataLab.com, Can You Still Perform Web Scraping With The New CNIL Guidelines?, Medium, 2020

[11] Legal Web Scraping for Legal Purposes, FindDataLab

[12] The Web Scraping Best Practices Guide, ScrapingHub

[13] J. Densmore, Ethics in Web Scraping, towards data science, 2017

[14] How to scrape websites without getting blocked, ScraperHero

[15] I. Kerins, DATA FOR PRICE INTELLIGENCE: LESSONS LEARNED SCRAPING 100 BILLION PRODUCTS PAGES, The ScrapingHub Blog, 2018