Saturday, August 31, 2024

AI Has Created a Battle Over Web Crawling




Most people assume that generative AI will keep getting better and better; after all, that’s been the trend so far. And it may do so. But what some people don’t realize is that generative AI models are only as good as the ginormous data sets they’re trained on, and those data sets aren’t constructed from proprietary data owned by leading AI companies like OpenAI and Anthropic. Instead, they’re made up of public data that was created by all of us—anyone who’s ever written a blog post, posted a video, commented on a Reddit thread, or done basically anything else online.

A new report from the Data Provenance Initiative, a volunteer collective of AI researchers, shines a light on what’s happening with all that data. The report, “Consent in Crisis: The Rapid Decline of the AI Data Commons,” notes that a significant number of organizations that feel threatened by generative AI are taking measures to wall off their data. IEEE Spectrum spoke with Shayne Longpre, a lead researcher with the Data Provenance Initiative, about the report and its implications for AI companies.

Shayne Longpre on:

  • How websites keep out web crawlers, and why
  • Disappearing data and what it means for AI companies
  • Synthetic data, peak data, and what happens next

  • The technology that websites use to keep out web crawlers isn’t new—the robot exclusion protocol was introduced in 1995. Can you explain what it is and why it suddenly became so relevant in the age of generative AI?


    portrait of a man with a blue collared shirt and arms folded across chest Shayne Longpre

    Shayne Longpre: Robots.txt is a machine-readable file that crawlers—bots that navigate the web and record what they see—use to determine whether or not to crawl certain parts of a website. It became the de facto standard in the age where websites used it primarily for directing web search. So think of Bing or Google Search; they wanted to record this information so they could improve the experience of navigating users around the web. This was a very symbiotic relationship because web search operates by sending traffic to websites and websites want that. Generally speaking, most websites played well with most crawlers.

    Let me next talk about a chain of claims that’s important to understand this. General-purpose AI models and their very impressive capabilities rely on the scale of data and compute that have been used to train them. Scale and data really matter, and there are very few sources that provide public scale like the web does. So many of the foundation models were trained on [data sets composed of] crawls of the web. Under these popular and important data sets are essentially just websites and the crawling infrastructure used to collect and package and process that data. Our study looks at not just the data sets, but the preference signals from the underlying websites. It’s the supply chain of the data itself.

    But in the last year, a lot of websites have started using robots.txt to restrict bots, especially websites that are monetized with advertising and paywalls—so think news and artists. They’re particularly fearful, and maybe rightly so, that generative AI might impinge on their livelihoods. So they’re taking measures to protect their data.

    When a site puts up robots.txt restrictions, it’s like putting up a no trespassing sign, right? It’s not enforceable. You have to trust that the crawlers will respect it.

    Longpre: The tragedy of this is that robots.txt is machine-readable but does not appear to be legally enforceable. Whereas the terms of service may be legally enforceable but are not machine-readable. In the terms of service, they can articulate in natural language what the preferences are for the use of the data. So they can say things like, “You can use this data, but not commercially.” But in a robots.txt, you have to individually specify crawlers and then say which parts of the website you allow or disallow for them. This puts an undue burden on websites to figure out, among thousands of different crawlers, which ones correspond to uses they would like and which ones they wouldn’t like.

    Do we know if crawlers generally do respect the restrictions in robots.txt?

    Longpre: Many of the major companies have documentation that explicitly says what their rules or procedures are. In the case, for example, of Anthropic, they do say that they respect the robots.txt for ClaudeBot. However, many of these companies have also been in the news lately because they’ve been accused of not respecting robots.txt and crawling websites anyway. It isn’t clear from the outside why there’s a discrepancy between what AI companies say they do and what they’re being accused of doing. But a lot of the pro-social groups that use crawling—smaller startups, academics, nonprofits, journalists—they tend to respect robots.txt. They’re not the intended target of these restrictions, but they get blocked by them.

    back to top

    In the report, you looked at three training data sets that are often used to train generative AI systems, which were all created from web crawls in years past. You found that from 2023 to 2024, there was a very significant rise in the number of crawled domains that had since been restricted. Can you talk about those findings?

    Longpre: What we found is that if you look at a particular data set, let’s take C4, which is very popular, created in 2019—in less than a year, about 5 percent of its data has been revoked if you respect or adhere to the preferences of the underlying websites. Now 5 percent doesn’t sound like a ton, but it is when you realize that this portion of the data mainly corresponds to the highest quality, most well-maintained, and freshest data. When we looked at the top 2,000 websites in this C4 data set—these are the top 2,000 by size, and they’re mostly news, large academic sites, social media, and well-curated high-quality websites—25 percent of the data in that top 2,000 has since been revoked. What this means is that the distribution of training data for models that respect robots.txt is rapidly shifting away from high-quality news, academic websites, forums, and social media to more organization and personal websites as well as e-commerce and blogs.

    That seems like it could be a problem if we’re asking some future version of ChatGPT or Perplexity to answer complicated questions, and it’s taking the information from personal blogs and shopping sites.

    Longpre: Exactly. It’s difficult to measure how this will affect models, but we suspect there will be a gap between the performance of models that respect robots.txt and the performance of models that have already secured this data and are willing to train on it anyway.

    But the older data sets are still intact. Can AI companies just use the older data sets? What’s the downside of that?

    Longpre: Well, continuous data freshness really matters. It also isn’t clear whether robots.txt can apply retroactively. Publishers would likely argue they do. So it depends on your appetite for lawsuits or where you also think that trends might go, especially in the U.S., with the ongoing lawsuits surrounding fair use of data. The prime example is obviously The New York Times against OpenAI and Microsoft, but there are now many variants. There’s a lot of uncertainty as to which way it will go.

    The report is called “Consent in Crisis.” Why do you consider it a crisis?

    Longpre: I think that it’s a crisis for data creators, because of the difficulty in expressing what they want with existing protocols. And also for some developers that are non-commercial and maybe not even related to AI—academics and researchers are finding that this data is becoming harder to access. And I think it’s also a crisis because it’s such a mess. The infrastructure was not designed to accommodate all of these different use cases at once. And it’s finally becoming a problem because of these huge industries colliding, with generative AI against news creators and others.

    What can AI companies do if this continues, and more and more data is restricted? What would their moves be in order to keep training enormous models?

    Longpre: The large companies will license it directly. It might not be a bad outcome for some of the large companies if a lot of this data is foreclosed or difficult to collect, it just creates a larger capital requirement for entry. I think big companies will invest more into the data collection pipeline and into gaining continuous access to valuable data sources that are user-generated, like YouTube and GitHub and Reddit. Acquiring exclusive access to those sites is probably an intelligent market play, but a problematic one from an antitrust perspective. I’m particularly concerned about the exclusive data acquisition relationships that might come out of this.

    back to top

    Do you think synthetic data can fill the gap?

    Longpre: Big companies are already using synthetic data in large quantities. There are both fears and opportunities with synthetic data. On one hand, there have been a series of works that have demonstrated the potential for model collapse, which is the degradation of a model due to training on poor synthetic data that may appear more often on the web as more and more generative bots are let loose. However, I think it’s unlikely that large models will be hampered much because they have quality filters, so the poor quality or repetitive stuff can be siphoned out. And the opportunities of synthetic data are when it’s created in a lab environment to be very high quality, and it’s targeting particularly domains that are underdeveloped.

    Do you give credence to the idea that we may be at peak data? Or do you feel like that’s an overblown concern?

    Longpre: There is a lot of untapped data out there. But interestingly, a lot of it is hidden behind PDFs, so you need to do OCR [optical character recognition]. A lot of data is locked away in governments, in proprietary channels, in unstructured formats, or difficult to extract formats like PDFs. I think there’ll be a lot more investment in figuring out how to extract that data. I do think that in terms of easily available data, many companies are starting to hit walls and turning to synthetic data.

    What’s the trend line here? Do you expect to see more websites putting up robots.txt restrictions in the coming years?

    Longpre: We expect the restrictions to rise, both in robots.txt and in terms of service. Those trend lines are very clear from our work, but they could be affected by external factors such as legislation, companies themselves changing their policies, the outcome of lawsuits, as well as community pressure from writers’ guilds and things like that. And I expect that the increased commoditization of data is going to cause more of a battlefield in this space.

    What would you like to see happen in terms of either standardization within the industry to making it easier for websites to express preferences about crawling?

    Longpre: At the Data Province Initiative, we definitely hope that new standards will emerge and be adopted to allow creators to express their preferences in a more granular way around the uses of their data. That would make the burden much easier on them. I think that’s a no-brainer and a win-win. But it’s not clear whose job it is to create or enforce these standards. It would be amazing if the [AI] companies themselves could come to this conclusion and do it. But the designer of the standard will almost inevitably have some bias towards their own use, especially if it’s a corporate entity.

    It’s also the case that preferences shouldn’t be respected in all cases. For instance, I don’t think that academics or journalists doing prosocial research should necessarily be foreclosed from accessing data with machines that is already public, on websites that anyone could go visit themselves. Not all data is created equal and not all uses are created equal.

    back to top

    Reference: https://ift.tt/o856RjP

Friday, August 30, 2024

Was an AI Image Generator Taken Down for Making Child Porn?




Why are AI companies valued in the millions and billions of dollars creating and distributing tools that can make AI-generated child sexual abuse material (CSAM)?

An image generator called Stable Diffusion version 1.5, which was created by the AI company Runway with funding from Stability AI, has been particularly implicated in the production of CSAM. And popular platforms such as Hugging Face and Civitai have been hosting that model and others that may have been trained on real images of child sexual abuse. In some cases, companies may even be breaking laws by hosting synthetic CSAM material on their servers. And why are mainstream companies and investors like Google, Nvidia, Intel, Salesforce, and Andreesen Horowitz pumping hundreds of millions of dollars into these companies? Their support amounts to subsidizing content for pedophiles.

As AI safety experts, we’ve been asking these questions to call out these companies and pressure them to take the corrective actions we outline below. And we’re happy today to report one major triumph: seemingly in response to our questions, Stable Diffusion version 1.5 has been removed from Hugging Face. But there’s much still to do, and meaningful progress may require legislation.

The Scope of the CSAM Problem

Child safety advocates began ringing the alarm bell last year: Researchers at Stanford’s Internet Observatory and the technology non-profit Thorn published a troubling report in June 2023. They found that broadly available and “open-source” AI image-generation tools were already being misused by malicious actors to make child sexual abuse material. In some cases, bad actors were making their own custom versions of these models (a process known as fine-tuning) with real child sexual abuse material to generate bespoke images of specific victims.

Last October, a report from the U.K. nonprofit Internet Watch Foundation (which runs a hotline for reports of child sexual abuse material) detailed the ease with which malicious actors are now making photorealistic AI-generated child sexual abuse material, at scale. The researchers included a “snapshot” study of one dark web CSAM forum, analyzing more than 11,000 AI-generated images posted in a one-month period; of those, nearly 3,000 were judged severe enough to be classified as criminal. The report urged stronger regulatory oversight of generative AI models.

AI models can be used to create this material because they’ve seen examples before. Researchers at Stanford discovered last December that one of the most significant data sets used to train image-generation models included thousands of pieces of CSAM. Many of the most popular downloadable open-source AI image generators, including the popular Stable Diffusion version 1.5 model, were trained using this data. That version of Stable Diffusion was created by Runway, though Stability AI paid for the computing power to produce the dataset and train the model, and Stability AI released the subsequent versions.

Runway did not respond to a request for comment. A Stability AI spokesperson emphasized that the company did not release or maintain Stable Diffusion version 1.5, and says the company has “implemented robust safeguards” against CSAM in subsequent models, including the use of filtered data sets for training.

Also last December, researchers at the social media analytics firm Graphika found a proliferation of dozens of “undressing” services, many based on open-source AI image generators, likely including Stable Diffusion. These services allow users to upload clothed pictures of people and produce what experts term nonconsensual intimate imagery (NCII) of both minors and adults, also sometimes referred to as deepfake pornography. Such websites can be easily found through Google searches, and users can pay for the services using credit cards online. Many of these services only work on women and girls, and these types of tools have been used to target female celebrities like Taylor Swift and politicians like U.S. representative Alexandria Ocasio-Cortez.

AI-generated CSAM has real effects. The child safety ecosystem is already overtaxed, with millions of files of suspected CSAM reported to hotlines annually. Anything that adds to that torrent of content—especially photorealistic abuse material—makes it more difficult to find children that are actively in harm’s way. Making matters worse, some malicious actors are using existing CSAM to generate synthetic images of these survivors—a horrific re-violation of their rights. Others are using the readily available “nudifying” apps to create sexual content from benign imagery of real children, and then using that newly generated content in sexual extortion schemes.

One Victory Against AI-Generated CSAM

Based on the Stanford investigation from last December, it’s well-known in the AI community that Stable Diffusion 1.5 was trained on child sexual abuse material, as was every other model trained on the LAION-5B data set. These models are being actively misused by malicious actors to make AI-generated CSAM. And even when they’re used to generate more benign material, their use inherently revictimizes the children whose abuse images went into their training data. So we asked the popular AI hosting platforms Hugging Face and Civitai why they hosted Stable Diffusion 1.5 and derivative models, making them available for free download?

It’s worth noting that Jeff Allen, a data scientist at the Integrity Institute, found that Stable Diffusion 1.5 was downloaded from Hugging Face over 6 million times in the past month, making it the most popular AI image-generator on the platform.

When we asked Hugging Face why it has continued to host the model, company spokesperson Brigitte Tousignant did not directly answer the question, but instead stated that the company doesn’t tolerate CSAM on its platform, that it incorporates a variety of safety tools, and that it encourages the community to use the Safe Stable Diffusion model that identifies and suppresses inappropriate images.

Then, yesterday, we checked Hugging Face and found that Stable Diffusion 1.5 is no longer available. Tousignant told us that Hugging Face didn’t take it down, and suggested that we contact Runway—which we did, again, but we have not yet received a response.

It’s undoubtedly a success that this model is no longer available for download from Hugging Face. Unfortunately, it’s still available on Civitai, as are hundreds of derivative models. When we contacted Civitai, a spokesperson told us that they have no knowledge of what training data Stable Diffusion 1.5 used, and that they would only take it down if there was evidence of misuse.

Platforms should be getting nervous about their liability. This past week saw the arrest of Pavel Durov, CEO of the messaging app Telegram, as part of an investigation related to CSAM and other crimes.

What’s Being Done About AI-Generated CSAM

The steady drumbeat of disturbing reports and news about AI-generated CSAM and NCII hasn’t let up. While some companies are trying to improve their products’ safety with the help of the Tech Coalition, what progress have we seen on the broader issue?

In April, Thorn and All Tech Is Human announced an initiative to bring together mainstream tech companies, generative AI developers, model hosting platforms, and more to define and commit to Safety by Design principles, which put preventing child sexual abuse at the center of the product development process. Ten companies (including Amazon, Civitai, Google, Meta, Microsoft, OpenAI, and Stability AI) committed to these principles, and several others joined in to co-author a related paper with more detailed recommended mitigations. The principles call on companies to develop, deploy, and maintain AI models that proactively address child safety risks; to build systems to ensure that any abuse material that does get produced is reliably detected; and to limit the distribution of the underlying models and services that are used to make this abuse material.

These kinds of voluntary commitments are a start. Rebecca Portnoff, Thorn’s head of data science, says the initiative seeks accountability by requiring companies to issue reports about their progress on the mitigation steps. It’s also collaborating with standard-setting institutions such as IEEE and NIST to integrate their efforts into new and existing standards, opening the door to third party audits that would “move past the honor system,” Portnoff says. Portnoff also notes that Thorn is engaging with policy makers to help them conceive legislation that would be both technically feasible and impactful. Indeed, many experts say it’s time to move beyond voluntary commitments.

We believe that there is a reckless race to the bottom currently underway in the AI industry. Companies are so furiously fighting to be technically in the lead that many of them are ignoring the ethical and possibly even legal consequences of their products. While some governments—including the European Union—are making headway on regulating AI, they haven’t gone far enough. If, for example, laws made it illegal to provide AI systems that can produce CSAM, tech companies might take notice.

The reality is that while some companies will abide by voluntary commitments, many will not. And of those that do, many will take action too slowly, either because they’re not ready or because they’re struggling to keep their competitive advantage. In the meantime, malicious actors will gravitate to those services and wreak havoc. That outcome is unacceptable.

What Tech Companies Should Do About AI-Generated CSAM

Experts saw this problem coming from a mile away, and child safety advocates have recommended common-sense strategies to combat it. If we miss this opportunity to do something to fix the situation, we’ll all bear the responsibility. At a minimum, all companies, including those releasing open source models, should be legally required to follow the commitments laid out in Thorn’s Safety by Design principles:

  • Detect, remove, and report CSAM from their training data sets before training their generative AI models.
  • Incorporate robust watermarks and content provenance systems into their generative AI models so generated images can be linked to the models that created them, as would be required under a California bill that would create Digital Content Provenance Standards for companies that do business in the state. The bill will likely be up for hoped-for signature by Governor Gavin Newson in the coming month.
  • Remove from their platforms any generative AI models that are known to be trained on CSAM or that are capable of producing CSAM. Refuse to rehost these models unless they’ve been fully reconstituted with the CSAM removed.
  • Identify models that have been intentionally fine-tuned on CSAM and permanently remove them from their platforms.
  • Remove “nudifying” apps from app stores, block search results for these tools and services, and work with payment providers to block payments to their makers.

There is no reason why generative AI needs to aid and abet the horrific abuse of children. But we will need all tools at hand—voluntary commitments, regulation, and public pressure—to change course and stop the race to the bottom.

The authors thank Rebecca Portnoff of Thorn, David Thiel of the Stanford Internet Observatory, Jeff Allen of the Integrity Institute, Ravit Dotan of TechBetter, and the tech policy researcher Owen Doyle for their help with this article.

Reference: https://ift.tt/OcbFt3a

Was an AI Image Generator Taken Down for Making Child Porn?




Why are AI companies valued in the millions and billions of dollars creating and distributing tools that can make AI-generated child sexual abuse material (CSAM)?

An image generator called Stable Diffusion version 1.5, which was created by the AI company Runway with funding from Stability AI, has been particularly implicated in the production of CSAM. And popular platforms such as Hugging Face and Civitai have been hosting that model and others that may have been trained on real images of child sexual abuse. In some cases, companies may even be breaking laws by hosting synthetic CSAM material on their servers. And why are mainstream companies and investors like Google, Nvidia, Intel, Salesforce, and Andreesen Horowitz pumping hundreds of millions of dollars into these companies? Their support amounts to subsidizing content for pedophiles.

As AI safety experts, we’ve been asking these questions to call out these companies and pressure them to take the corrective actions we outline below. And we’re happy today to report one major triumph: seemingly in response to our questions, Stable Diffusion version 1.5 has been removed from Hugging Face. But there’s much still to do, and meaningful progress may require legislation.

The Scope of the CSAM Problem

Child safety advocates began ringing the alarm bell last year: Researchers at Stanford’s Internet Observatory and the technology non-profit Thorn published a troubling report in June 2023. They found that broadly available and “open-source” AI image-generation tools were already being misused by malicious actors to make child sexual abuse material. In some cases, bad actors were making their own custom versions of these models (a process known as fine-tuning) with real child sexual abuse material to generate bespoke images of specific victims.

Last October, a report from the U.K. nonprofit Internet Watch Foundation (which runs a hotline for reports of child sexual abuse material) detailed the ease with which malicious actors are now making photorealistic AI-generated child sexual abuse material, at scale. The researchers included a “snapshot” study of one dark web CSAM forum, analyzing more than 11,000 AI-generated images posted in a one-month period; of those, nearly 3,000 were judged severe enough to be classified as criminal. The report urged stronger regulatory oversight of generative AI models.

AI models can be used to create this material because they’ve seen examples before. Researchers at Stanford discovered last December that one of the most significant data sets used to train image-generation models included thousands of pieces of CSAM. Many of the most popular downloadable open-source AI image generators, including the popular Stable Diffusion version 1.5 model, were trained using this data. That version of Stable Diffusion was created by Runway, though Stability AI paid for the computing power to produce the dataset and train the model, and Stability AI released the subsequent versions.

Runway did not respond to a request for comment. A Stability AI spokesperson emphasized that the company did not release or maintain Stable Diffusion version 1.5, and says the company has “implemented robust safeguards” against CSAM in subsequent models, including the use of filtered data sets for training.

Also last December, researchers at the social media analytics firm Graphika found a proliferation of dozens of “undressing” services, many based on open-source AI image generators, likely including Stable Diffusion. These services allow users to upload clothed pictures of people and produce what experts term nonconsensual intimate imagery (NCII) of both minors and adults, also sometimes referred to as deepfake pornography. Such websites can be easily found through Google searches, and users can pay for the services using credit cards online. Many of these services only work on women and girls, and these types of tools have been used to target female celebrities like Taylor Swift and politicians like U.S. representative Alexandria Ocasio-Cortez.

AI-generated CSAM has real effects. The child safety ecosystem is already overtaxed, with millions of files of suspected CSAM reported to hotlines annually. Anything that adds to that torrent of content—especially photorealistic abuse material—makes it more difficult to find children that are actively in harm’s way. Making matters worse, some malicious actors are using existing CSAM to generate synthetic images of these survivors—a horrific re-violation of their rights. Others are using the readily available “nudifying” apps to create sexual content from benign imagery of real children, and then using that newly generated content in sexual extortion schemes.

One Victory Against AI-Generated CSAM

Based on the Stanford investigation from last December, it’s well-known in the AI community that Stable Diffusion 1.5 was trained on child sexual abuse material, as was every other model trained on the LAION-5B data set. These models are being actively misused by malicious actors to make AI-generated CSAM. And even when they’re used to generate more benign material, their use inherently revictimizes the children whose abuse images went into their training data. So we asked the popular AI hosting platforms Hugging Face and Civitai why they hosted Stable Diffusion 1.5 and derivative models, making them available for free download?

It’s worth noting that Jeff Allen, a data scientist at the Integrity Institute, found that Stable Diffusion 1.5 was downloaded from Hugging Face over 6 million times in the past month, making it the most popular AI image-generator on the platform.

When we asked Hugging Face why it has continued to host the model, company spokesperson Brigitte Tousignant did not directly answer the question, but instead stated that the company doesn’t tolerate CSAM on its platform, that it incorporates a variety of safety tools, and that it encourages the community to use the Safe Stable Diffusion model that identifies and suppresses inappropriate images.

Then, yesterday, we checked Hugging Face and found that Stable Diffusion 1.5 is no longer available. Tousignant told us that Hugging Face didn’t take it down, and suggested that we contact Runway—which we did, again, but we have not yet received a response.

It’s undoubtedly a success that this model is no longer available for download from Hugging Face. Unfortunately, it’s still available on Civitai, as are hundreds of derivative models. When we contacted Civitai, a spokesperson told us that they have no knowledge of what training data Stable Diffusion 1.5 used, and that they would only take it down if there was evidence of misuse.

Platforms should be getting nervous about their liability. This past week saw the arrest of Pavel Durov, CEO of the messaging app Telegram, as part of an investigation related to CSAM and other crimes.

What’s Being Done About AI-Generated CSAM

The steady drumbeat of disturbing reports and news about AI-generated CSAM and NCII hasn’t let up. While some companies are trying to improve their products’ safety with the help of the Tech Coalition, what progress have we seen on the broader issue?

In April, Thorn and All Tech Is Human announced an initiative to bring together mainstream tech companies, generative AI developers, model hosting platforms, and more to define and commit to Safety by Design principles, which put preventing child sexual abuse at the center of the product development process. Ten companies (including Amazon, Civitai, Google, Meta, Microsoft, OpenAI, and Stability AI) committed to these principles, and several others joined in to co-author a related paper with more detailed recommended mitigations. The principles call on companies to develop, deploy, and maintain AI models that proactively address child safety risks; to build systems to ensure that any abuse material that does get produced is reliably detected; and to limit the distribution of the underlying models and services that are used to make this abuse material.

These kinds of voluntary commitments are a start. Rebecca Portnoff, Thorn’s head of data science, says the initiative seeks accountability by requiring companies to issue reports about their progress on the mitigation steps. It’s also collaborating with standard-setting institutions such as IEEE and NIST to integrate their efforts into new and existing standards, opening the door to third party audits that would “move past the honor system,” Portnoff says. Portnoff also notes that Thorn is engaging with policy makers to help them conceive legislation that would be both technically feasible and impactful. Indeed, many experts say it’s time to move beyond voluntary commitments.

We believe that there is a reckless race to the bottom currently underway in the AI industry. Companies are so furiously fighting to be technically in the lead that many of them are ignoring the ethical and possibly even legal consequences of their products. While some governments—including the European Union—are making headway on regulating AI, they haven’t gone far enough. If, for example, laws made it illegal to provide AI systems that can produce CSAM, tech companies might take notice.

The reality is that while some companies will abide by voluntary commitments, many will not. And of those that do, many will take action too slowly, either because they’re not ready or because they’re struggling to keep their competitive advantage. In the meantime, malicious actors will gravitate to those services and wreak havoc. That outcome is unacceptable.

What Tech Companies Should Do About AI-Generated CSAM

Experts saw this problem coming from a mile away, and child safety advocates have recommended common-sense strategies to combat it. If we miss this opportunity to do something to fix the situation, we’ll all bear the responsibility. At a minimum, all companies, including those releasing open source models, should be legally required to follow the commitments laid out in Thorn’s Safety by Design principles:

  • Detect, remove, and report CSAM from their training data sets before training their generative AI models.
  • Incorporate robust watermarks and content provenance systems into their generative AI models so generated images can be linked to the models that created them, as would be required under a California bill that would create Digital Content Provenance Standards for companies that do business in the state. The bill will likely be up for hoped-for signature by Governor Gavin Newson in the coming month.
  • Remove from their platforms any generative AI models that are known to be trained on CSAM or that are capable of producing CSAM. Refuse to rehost these models unless they’ve been fully reconstituted with the CSAM removed.
  • Identify models that have been intentionally fine-tuned on CSAM and permanently remove them from their platforms.
  • Remove “nudifying” apps from app stores, block search results for these tools and services, and work with payment providers to block payments to their makers.

There is no reason why generative AI needs to aid and abet the horrific abuse of children. But we will need all tools at hand—voluntary commitments, regulation, and public pressure—to change course and stop the race to the bottom.

The authors thank Rebecca Portnoff of Thorn, David Thiel of the Stanford Internet Observatory, Jeff Allen of the Integrity Institute, Ravit Dotan of TechBetter, and the tech policy researcher Owen Doyle for their help with this article.

Reference: https://ift.tt/LkOE57s

Video Friday: Robots Solving Table Tennis




Video Friday is your weekly selection of awesome robotics videos, collected by your friends at IEEE Spectrum robotics. We also post a weekly calendar of upcoming robotics events for the next few months. Please send us your events for inclusion.

ICRA@40: 23–26 September 2024, ROTTERDAM, NETHERLANDS
IROS 2024: 14–18 October 2024, ABU DHABI, UAE
ICSR 2024: 23–26 October 2024, ODENSE, DENMARK
Cybathlon 2024: 25–27 October 2024, ZURICH

Enjoy today’s videos!

Imbuing robots with “human-level performance” in anything is an enormous challenge, but it’s worth it when you see a robot with the skill to interact with a human on a (nearly) human level. Google DeepMind has managed to achieve amateur human-level competence at table tennis, which is much harder than it looks, even for humans. Pannag Sanketi, a tech-lead manager in the robotics team at DeepMind, shared some interesting insights about performing the research. But first, video!

Some behind the scenes detail from Pannag:

  • The robot had not seen any participants before. So we knew we had a cool agent, but we had no idea how it was going to fare in a full match with real humans. To witness it outmaneuver even some of the most advanced players was such a delightful moment for team!
  • All the participants had a lot of fun playing against the robot, irrespective of who won the match. And all of them wanted to play more. Some of them said it will be great to have the robot as a playing partner. From the videos, you can even see how much fun the user study hosts sitting there (who are not authors on the paper) are having watching the games!
  • Barney, who is a professional coach, was an advisor on the project, and our chief evaluator of robot’s skills the way he evaluates his students. He also got surprised by how the robot is always able to learn from the last few weeks’ sessions.
  • We invested a lot in remote and automated 24x7 operations. So not the setup in this video, but there are other cells that we can run 24x7 with a ball thrower.
  • We even tried robot-vs-robot, i.e. 2 robots playing against each other! :) The line between collaboration and competition becomes very interesting when they try to learn by playing with each other.

[ DeepMind ]

Thanks, Heni!

Yoink.

[ MIT ]

Considering how their stability and recovery is often tested, teaching robot dogs to be shy of humans is an excellent idea.

[ Deep Robotics ]

Yes, quadruped robots need tow truck hooks.

[ Paper ]

Earthworm-inspired robots require novel actuators, and Ayato Kanada at Kyushu University has come up with a neat one.

[ Paper ]

Thanks, Ayato!

Meet the AstroAnt! This miniaturized swarm robot can ride atop a lunar rover and collect data related to its health, including surface temperatures and damage from micrometeoroid impacts. In the summer of 2024, with support from our collaborator Castrol, the Media Lab’s Space Exploration Initiative tested AstroAnt in the Canary Islands, where the volcanic landscape resembles the lunar surface.

[ MIT ]

Kengoro has a new forearm that mimics the human radioulnar joint giving it an even more natural badminton swing.

[ JSK Lab ]

Thanks, Kento!

Gromit’s concern that Wallace is becoming too dependent on his inventions proves justified, when Wallace invents a “smart” gnome that seems to develop a mind of its own. When it emerges that a vengeful figure from the past might be masterminding things, it falls to Gromit to battle sinister forces and save his master… or Wallace may never be able to invent again!

[ Wallace and Gromit ]

ASTORINO is a modern 6-axis robot based on 3D printing technology. Programmable in AS-language, it facilitates the preparation of classes with ready-made teaching materials, is easy both to use and to repair, and gives the opportunity to learn and make mistakes without fear of breaking it.

[ Kawasaki ]

Engineers at NASA’s Jet Propulsion Laboratory are testing a prototype of IceNode, a robot designed to access one of the most difficult-to-reach places on Earth. The team envisions a fleet of these autonomous robots deploying into unmapped underwater cavities beneath Antarctic ice shelves. There, they’d measure how fast the ice is melting — data that’s crucial to helping scientists accurately project how much global sea levels will rise.

[ IceNode ]

Los Alamos National Laboratory, in a consortium with four other National Laboratories, is leading the charge in finding the best practices to find orphaned wells. These abandoned wells can leak methane gas into the atmosphere and possibly leak liquid into the ground water.

[ LANL ]

Looks like Fourier has been working on something new, although this is still at the point of “looks like” rather than something real.

[ Fourier ]

Bio-Inspired Robot Hands: Altus Dexterity is a collaboration between researchers and professionals from Carnegie Mellon University, UPMC, the University of Illinois and the University of Houston.

[ Altus Dexterity ]

PiPER is a lightweight robotic arm with six integrated joint motors for smooth, precise control. Weighing just 4.2kg, it easily handles a 1.5kg payload and is made from durable yet lightweight materials for versatile use across various environments. Available for just $2,499 USD.

[ AgileX ]

At 104 years old, Lilabel has seen over a century of automotive transformation, from sharing a single car with her family in the 1920s to experiencing her first ride in a robotaxi.

[ Zoox ]

Traditionally, blind juggling robots use plates that are slightly concave to help them with ball control, but it’s also possible to make a blind juggler the hard way. Which, honestly, is much more impressive.

[ Jugglebot ]

Reference: https://ift.tt/U96j5GN

ChatGPT hits 200 million active weekly users, but how many will admit using it?


The OpenAI logo emerging from broken jail bars, on a purple background.

Enlarge (credit: Benj Edwards / Getty Images)

On Thursday, OpenAI said that ChatGPT has attracted over 200 million weekly active users, according to a report from Axios, doubling the AI assistant's user base since November 2023. The company also revealed that 92 percent of Fortune 500 companies are now using its products, highlighting the growing adoption of generative AI tools in the corporate world.

The rapid growth in user numbers for ChatGPT (which is not a new phenomenon for OpenAI) suggests growing interest in—and perhaps reliance on— the AI-powered tool, despite frequent skepticism from some critics of the tech industry.

"Generative AI is a product with no mass-market utility—at least on the scale of truly revolutionary movements like the original cloud computing and smartphone booms," PR consultant and vocal OpenAI critic Ed Zitron blogged in July. "And it’s one that costs an eye-watering amount to build and run."

Read 9 remaining paragraphs | Comments

Reference : https://ift.tt/DM3ApCv

Thursday, August 29, 2024

Commercial spyware vendor exploits used by Kremlin-backed hackers, Google says


Commercial spyware vendor exploits used by Kremlin-backed hackers, Google says

Enlarge (credit: Getty Images)

Critics of spyware and exploit sellers have long warned that the advanced hacking sold by commercial surveillance vendors (CSVs) represents a worldwide danger because they inevitably find their way into the hands of malicious parties, even when the CSVs promise they will be used only to target known criminals. On Thursday, Google analysts presented evidence bolstering the critique after finding that spies working on behalf of the Kremlin used exploits that are “identical or strikingly similar” to those sold by spyware makers Intellexa and NSO Group.

The hacking outfit, tracked under names including APT29, Cozy Bear, and Midnight Blizzard, is widely assessed to work on behalf of Russia’s Foreign Intelligence Service, or the SVR. Researchers with Google’s Threat Analysis Group, which tracks nation-state hacking, said Thursday that they observed APT29 using exploits identical or closely identical to those first used by commercial exploit sellers NSO Group of Israel and Intellexa of Ireland. In both cases, the Commercial Surveillance Vendors’ exploits were first used as zero-days, meaning when the vulnerabilities weren’t publicly known and no patch was available.

Identical or strikingly similar

Once patches became available for the vulnerabilities, TAG said, APT29 used the exploits in watering hole attacks, which infect targets by surreptitiously planting exploits on sites they’re known to frequent. TAG said APT29 used the exploits as n-days, which target vulnerabilities that have recently been fixed but not yet widely installed by users.

Read 8 remaining paragraphs | Comments

Reference : https://ift.tt/AjY6G2L

Celebrate IEEE Day’s 15th Anniversary on 1 October




IEEE Day commemorates the first time engineers worldwide gathered to share their technical ideas, in 1884. This year the annual event is scheduled for 1 October. Its theme is Leveraging Technology for a Better Tomorrow, emphasizing the positive impact tech can have.

IEEE Day, first celebrated in 2010, marks its 15th anniversary this year. Over the years, thousands of members have participated in events organized by IEEE sections, student branches, affinity groups, and society chapters. IEEE Day events provide a platform for engineers to share ideas and inspire one another.

For some sections, one day is not enough. Celebrations are scheduled between 29 September and 12 October, both virtually and in person, to connect members across borders.

“As we commemorate IEEE Day’s 15th anniversary, it is an opportune moment to reflect upon the remarkable influence that IEEE has had on each and every member, as well as the joyous events that have transpired annually across the globe,” says IEEE Member Cybele Ghanem, 2024 IEEE Day chair.

“This year holds the promise of an exceptional celebration, bringing together thousands of IEEE members in hundreds of events worldwide to honor the historical significance of IEEE,” Ghanem says. “I encourage everyone to seize this opportunity to review their IEEE journey, share their cherished moments with us, and embark on an even more exhilarating journey ahead.”

a small group of people sitting in chairs on a small stage talking into microphones One of several panel discussions organized by the IEEE Hyderabad Section to mark IEEE Day 2023.Bindu Madhavi

Global collaboration

Past events have included humanitarian projects, lectures on cutting-edge technical topics, sessions on career development and résumé writing, networking events, and an IEEE flash mob.

The events are an excellent way to engage IEEE members, recruit new ones, provide volunteering opportunities, and showcase the organization, Ghanem says. Through workshops, seminars, and networking sessions, IEEE Day encourages knowledge exchange and camaraderie.

“This year holds the promise of an exceptional celebration, bringing together thousands of IEEE members in hundreds of events worldwide to honor the historical significance of IEEE.”

Activities and contests

Participants can engage in competitions and win prizes.

The IEEE Day photo and video contests allow attendees to visually document what took place at their events, then share the images with the world. There are three photography categories: humanitarian, STEM, and technical. Videos may be long-form or short.

Contest winners receive monetary rewards and get a chance to be showcased in IEEE online and print publications as well as on social media platforms. So, take along your phone or camera when attending an IEEE Day event to capture the spirit of innovation and collaboration.

Join the celebration

IEEE will be offering a special discount on membership for those joining during the IEEE Day period. Many IEEE societies are planning special offers as well.

Resources and more information can be found on the IEEE Day website.

Reference: https://ift.tt/H9EVFDf

Escape Proprietary Smart Home Tech With This DIY Panel




Over the last few years, I’ve added a fair amount of smart-home technology to my house. Among other things, I can control lights and outlets, monitor the status of various appliances, measure how much electricity and water I’m using, and even cut off the water supply in the event of a leak. All this technology is coordinated through a hub, which I originally accessed through a conventional browser-based interface. But scrolling and clicking through screens to find the reading or setting I want is a) slow and b) boring. I wanted an interface that was fast and fun—a physical control panel with displays and buttons.

Something like the control room in the nuclear power plant in 1979’s The China Syndrome. I was about 10 years old when I saw that movie, and my overwhelming thought while watching Jack Lemmon trying to avert a meltdown was, “Boy, those panels look neat!” So they became my North Star for this design.

Before I could work on the aesthetic elements, however, I had to consider how my panel was going to process inputs and outputs and communicate with the systems in my home. The devices in my home are tied together using the open source Home Assistant platform. Using an open source platform means I don’t have to worry that, for example, I suddenly won’t be able to turn on my lights due to a forced upgrade of a proprietary system, or wonder if someone in the cloud is monitoring the activity in my home.

The heart of my Home Assistant setup is a hub powered by an old PC running Linux. This handles wireless connections with my sensors, appliances, and other devices. For commercial off-the-shelf equipment—like my energy meter—this communication is typically via Z-Wave. My homebrew devices are connected to the GPIO pins of a Raspberry Pi, which relays their state via Wi-Fi using the MQTT standard protocol for the Internet of Things. However, I decided on a wired Ethernet connection between the control panel and my hub PC, as this would let me use Power over Ethernet (PoE) to supply electricity to the panel.

A variety of electronic components such as individual LEDs and seven segment displays, buttons, and switches. The different types of components used in the control panel include a touchscreen display [A], LED displays [B], Raspberry Pis [C], Power over Ethernet boards [D], and an emergency stop button [E]. James Provost

In fact, I use two Ethernet connections, because I decided to divide the functionality of the control panel across two model 3B+ Raspberry Pis, which cost about US $35 each (a complete bill of materials can be found on my GitHub repository). One Pi drives a touchscreen display, while the other handles the buttons and LEDs. Each is fitted with a $20 add-on PoE “hat” to draw power from its Ethernet connection.

Driving all the buttons and LEDs requires over 50 I/O signals, more than can be accommodated by the GPIO header found on a Pi. Although this header has 40 pins, only about 26 are usable in practice. So I used three $6 I2C expanders, each capable of handling 16 I/O signals and relaying them back via a two-wire data bus.

I don’t have to worry that I suddenly won’t be able to turn on my lights due to a forced upgrade.

The software that drives each Pi also has its functionality separated out. This is done using Docker containers: software environments that act as self-contained sandboxes. The Pi responsible for the touchscreen has three containers: One runs a browser in kiosk mode, which fetches a graphical display from the Home Assistant hub. A second container runs a Python script, which translates touchscreen inputs—such as pressing an icon for another information screen—into requests to the hub. A third container runs a local Web server: When the kiosk browser is pointed to this local server instead of the hub, the screen displays internal diagnostic information that is useful for troubleshooting.

The other Pi has two containers running Python scripts. One handles all the button inputs and sends commands to the hub. The other requests status information from the hub and updates all the LEDs accordingly.

The first Raspberry Pi has containers labeled \u201ctouch screen commands\u201d, \u201cdiagnsotic web server\u201d and \u201cKiosk Web Browser.\u201d The second Raspberry Pi has containers labelled \u201cButton Script\u201d and \u201cLED script.\u201d Input and output functions are split across software containers running on the panel’s Raspberry Pis. These communicate with a hub to send commands and get status updates. James Provost

These containers run on top of BalenaOS, an operating system that’s designed for running these sandboxes on edge as well as embedded devices like the Pi. Full disclosure: I’m the edge AI enablement lead for Balena, the company responsible for BalenaOS, but I started using the operating system before I joined the company because of its container-based approach. You can run Docker containers using the Raspberry Pi OS, but BalenaOS makes it easier to manage containers, including starting, stopping, and updating them remotely.

You might think that this software infrastructure is overkill for simply reading the state of some buttons and illuminating some lights, but I like containers because they let me work on one subsystem without worrying about how it will affect the rest of the system: I can tinker with how button presses are sent to the hub without messing up the touchscreen.

The buttons and various displays are mounted in a set of 3D-printed panels. I first mapped these out, full size, on paper, and then created the 3D print files in TinkerCAD. The labels for each control, as well as a schematic of my home’s water pipes, were printed as indentations in each segment, and then I filled them with white spackle for contrast. I then mounted the array of panels in an off-the-shelf $45 “floater” frame.

By a small miracle of the maker spirits, the panel segments and the frame all fit together nicely on the first try. I mounted the finished panel in a hallway of my home, somewhat to the bemusement of my family. But I don’t mind: If I ever have a water leak, I’ll get to press the big emergency button to shut off the main valve with all the aplomb of Jack Lemmon trying to stop a nuclear meltdown!

Reference: https://ift.tt/GHxf1Tb

Robot Metalsmiths Are Resurrecting Toroidal Tanks for NASA




In the 1960s and 1970s, NASA spent a lot of time thinking about whether toroidal (donut-shaped) fuel tanks were the way to go with its spacecraft. Toroidal tanks have a bunch of potential advantages over conventional spherical fuel tanks. For example, you can fit nearly 40% more volume within a toroidal tank than if you were using multiple spherical tanks within the same space. And perhaps most interestingly, you can shove stuff (like the back of an engine) through the middle of a toroidal tank, which could lead to some substantial efficiency gains if the tanks could also handle structural loads.

Because of their relatively complex shape, toroidal tanks are much more difficult to make than spherical tanks. Even though these tanks can perform better, NASA simply doesn’t have the expertise to manufacture them anymore, since each one has to be hand-built by highly skilled humans. But a company called Machina Labs thinks that they can do this with robots instead. And their vision is to completely change how we make things out of metal.


The fundamental problem that Machina Labs is trying to solve is that if you want to build parts out of metal efficiently at scale, it’s a slow process. Large metal parts need their own custom dies, which are very expensive one-offs that are about as inflexible as it’s possible to get, and then entire factories are built around these parts. It’s a huge investment, which means that it doesn’t matter if you find some new geometry or technique or material or market, because you have to justify that enormous up-front cost by making as much of the original thing as you possibly can, stifling the potential for rapid and flexible innovation.

On the other end of the spectrum you have the also very slow and expensive process of making metal parts one at a time by hand. A few hundred years ago, this was the only way of making metal parts: skilled metalworkers using hand tools for months to make things like armor and weapons. The nice thing about an expert metalworker is that they can use their skills and experience to make anything at all, which is where Machina Labs’ vision comes from, explains CEO Edward Mehr who co-founded Machina Labs after spending time at SpaceX followed by leading the 3D printing team at Relativity Space.

“Craftsmen can pick up different tools and apply them creatively to metal to do all kinds of different things. One day they can pick up a hammer and form a shield out of a sheet of metal,” says Mehr. “Next, they pick up the same hammer, and create a sword out of a metal rod. They’re very flexible.”

The technique that a human metalworker uses to shape metal is called forging, which preserves the grain flow of the metal as it’s worked. Casting, stamping, or milling metal (which are all ways of automating metal part production) are simply not as strong or as durable as parts that are forged, which can be an important differentiator for (say) things that have to go into space. But more on that in a bit.

The problem with human metalworkers is that the throughput is bad—humans are slow, and highly skilled humans in particular don’t scale well. For Mehr and Machina Labs, this is where the robots come in.

“We want to automate and scale using a platform called the ‘robotic craftsman.’ Our core enablers are robots that give us the kinematics of a human craftsman, and artificial intelligence that gives us control over the process,” Mehr says. “The concept is that we can do any process that a human craftsman can do, and actually some that humans can’t do because we can apply more force with better accuracy.”

This flexibility that robot metalworkers offer also enables the crafting of bespoke parts that would be impractical to make in any other way. These include toroidal (donut-shaped) fuel tanks that NASA has had its eye on for the last half century or so.

Two people stand in a warehouse with a huge silver donut-shaped tank in front of them. Machina Labs’ CEO Edward Mehr (on right) stands behind a 15 foot toroidal fuel tank.Machina Labs

“The main challenge of these tanks is that the geometry is complex,” Mehr says. “Sixty years ago, NASA was bump-forming them with very skilled craftspeople, but a lot of them aren’t around anymore.” Mehr explains that the only other way to get that geometry is with dies, but for NASA, getting a die made for a fuel tank that’s necessarily been customized for one single spacecraft would be pretty much impossible to justify. “So one of the main reasons we’re not using toroidal tanks is because it’s just hard to make them.”

Machina Labs is now making toroidal tanks for NASA. For the moment, the robots are just doing the shaping, which is the tough part. Humans then weld the pieces together. But there’s no reason why the robots couldn’t do the entire process end-to-end and even more efficiently. Currently, they’re doing it the “human” way based on existing plans from NASA. “In the future,” Mehr tells us, “we can actually form these tanks in one or two pieces. That’s the next area that we’re exploring with NASA—how can we do things differently now that we don’t need to design around human ergonomics?”

Machina Labs’ ‘robotic craftsmen’ work in pairs to shape sheet metal, with one robot on each side of the sheet. The robots align their tools slightly offset from each other with the metal between them such that as the robots move across the sheet, it bends between the tools. Machina Labs

The video above shows Machina’s robots working on a tank that’s 4.572 m (15 feet) in diameter, likely destined for the Moon. “The main application is for lunar landers,” says Mehr. “The toroidal tanks bring the center of gravity of the vehicle lower than what you would have with spherical or pill-shaped tanks.”

Training these robots to work metal like this is done primarily through physics-based simulations that Machina developed in house (existing software being too slow), followed by human-guided iterations based on the resulting real-world data. The way that metal moves under pressure can be simulated pretty well, and although there’s certainly still a sim-to-real gap (simulating how the robot’s tool adheres to the surface of the material is particularly tricky), the robots are collecting so much empirical data that Machina is making substantial progress towards full autonomy, and even finding ways to improve the process.

A hand holds a silvery piece of sheet metal that has been forged into a series of symmetrical waves. An example of the kind of complex metal parts that Machina’s robots are able to make.Machina Labs

Ultimately, Machina wants to use robots to produce all kinds of metal parts. On the commercial side, they’re exploring things like car body panels, offering the option to change how your car looks in geometry rather than just color. The requirement for a couple of beefy robots to make this work means that roboforming is unlikely to become as pervasive as 3D printing, but the broader concept is the same: making physical objects a software problem rather than a hardware problem to enable customization at scale.

Reference: https://ift.tt/2toEOns

Wednesday, August 28, 2024

Unpatchable 0-day in surveillance cam is being exploited to install Mirai


The word ZERO-DAY is hidden amidst a screen filled with ones and zeroes.

Enlarge (credit: Getty Images)

Malicious hackers are exploiting a critical vulnerability in a widely used security camera to spread Mirai, a family of malware that wrangles infected Internet of Things devices into large networks for use in attacks that take down websites and other Internet-connected devices.

The attacks target the AVM1203, a surveillance device from Taiwan-based manufacturer AVTECH, network security provider Akamai said Wednesday. Unknown attackers have been exploiting a 5-year-old vulnerability since March. The zero-day vulnerability, tracked as CVE-2024-7029, is easy to exploit and allows attackers to execute malicious code. The AVM1203 is no longer sold or supported, so no update is available to fix the critical zero-day.

That time a ragtag army shook the Internet

Akamai said that the attackers are exploiting the vulnerability so they can install a variant of Mirai, which arrived in September 2016 when a botnet of infected devices took down cybersecurity news site Krebs on Security. Mirai contained functionality that allowed a ragtag army of compromised webcams, routers, and other types of IoT devices to wage distributed denial-of-service attacks of record-setting sizes. In the weeks that followed, the Mirai botnet delivered similar attacks on Internet service providers and other targets. One such attack, against dynamic domain name provider Dyn paralyzed vast swaths of the Internet.

Read 6 remaining paragraphs | Comments

Reference : https://ift.tt/2MUmgHK

Backdoor infecting VPNs used “magic packets” for stealth and security

When threat actors use backdoor malware to gain access to a network, they want to make sure all their hard work can’t be leveraged by comp...