Lessons Learned from Clearview AI’s Web Scraping Actions

Clearview AI has been in the news for several reasons in the past couple of months. The organization’s use of web scraping for facial recognition outraged privacy advocates and has led to legal difficulties for the company. A data breach within two months of making the headlines demonstrated that Clearview AI may have difficulty protecting the data in its care and underscores the importance of properly securing repositories of sensitive user data.

Web Scraping for Profit

Many organizations and individuals post a great deal of potentially sensitive information on the Internet. However, in many cases, this information is placed there for good reason. For example, an organization may need to put contact information for their business on a webpage to enable customer support. Additionally, a business puts a great deal of information about what they do on their webpage in order to attract customers looking for their particular products and services.

However, this data can also be used for malicious purposes. Cybercriminals can use it to build extremely tailored (and therefore likely more successful) spear phishing attacks against the organization’s employees. Competitors may use publicly posted data to glean useful insights about an organization’s research and development efforts or to optimize a marketing campaign to lure away an organization’s customers.

Web Scraping in the Courts

Web scraping often falls into a legal gray area since the information is publicly available but is not intended for the uses that a web scraper puts it to. Whether an organization is permitted to perform web scraping largely comes down to the purpose and how the data is put to use.

An example of a largely accepted form of web scraping is travel websites that are designed to help travelers find the best deal on flights, lodgings, etc. by aggregating offers from multiple different vendors. In many cases, this aggregation is performed by scraping the vendors’ websites to collect pricing and availability information. While the vendors generally do not like these sites (preferring that customers browse their sites and purchase directly from them), several of these aggregation sites are still operating and many court cases have ruled in favor of web scraping.

On the other side of the web scraping debate, you have organizations like Clearview AI. Clearview AI specialized in the use of web scraping to improve facial recognition algorithms. The organization collected publicly accessible photos from social media sites like Facebook, LinkedIn, and Twitter to build a database of over 3 billion photos of people. This database is used by clients (such as law enforcement) to train or operate facial recognition algorithms.

Clearview AI has been taken to court for its actions and has received, and ignored, cease and desist letters from most social media companies citing that the organization has violated their Terms of Service by using web scraping. Clearview AI claims that their right to scrape publicly available information (i.e. pictures) is protected under the First Amendment to the US Constitution.

However, the First Amendment only protects against government interference with free speech, not the right of private businesses to enforce their Terms of Service. Additionally, while web scraping may be deemed legal (based upon court precedents), the use of it by Clearview AI to collect images for facial recognition can violate other laws focused on user privacy and biometric data.

Challenges in Data Security

Even if Clearview AI is permitted to continue their current practices under the law, the organization should be able to demonstrate that they are capable of properly protecting the data that they are collecting. A recent hack of Clearview AI calls this ability into question.

In February 2020, the company revealed a data breach in which the attacker was able to access the company’s entire client list, including data on the number of searches that each of the 600+ government agencies using the application have run against the organization’s database. While the breach did not extend to include the database of 3 billion user images that the organization has collected, it demonstrates that Clearview AI’s data security may not be sufficient to stand up to the additional notoriety that it has received due to recent news coverage and court cases. Such a massive collection of valuable data is a huge target for cybercriminals, and Clearview AI is likely to be a target of many attacks in the near future.

Protecting Sensitive Data

Organizations like Clearview AI take advantage of the availability of potentially sensitive data on the public Internet. By aggregating this data into one place, they can derive useful intelligence from it or make a profit by selling access to the data to interested third parties.

However, when collecting and storing this type of data, these organizations can run into challenges. In some cases, data collection (especially of sensitive data like photographs) can run into legal challenges. Most of the time, an organization with such a large collection of sensitive data becomes a target for cybercriminals trying to take advantage of it.

Within two months of a New York Times article publicizing the company and its potential privacy impacts, Clearview AI was the victim of a significant data breach. This should serve as a warning to the company and to any organization with a massive collection of sensitive customer data of the potential risks and the importance of implementing appropriate security controls to protect this data.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.