The Privacy Across Borders (PAB) team has been exploring how adversaries can use artificial intelligence (AI) to undermine privacy and national security. PAB student research assistant (RA) Natalia Baigorri examined the national security threats posed by adversarial use of AI. In her post, she highlighted how the concerns about Chinese ownership of TikTok, as accepted by the Supreme Court in TikTok v. Garland, could apply to DeepSeek, which “risks providing the Chinese government the ‘means to undermine U.S. national security’ through ‘data collection and covert content manipulation.’” 

PAB Student RA Alexandra Mansfield further analyzed the applicability of TikTok v. Garland to DeepSeek, focusing on the First Amendment implications. She found:

“The Court also recognized the value of editorial discretion, protecting the judgment involved in shaping, selecting, or generating content. An LLM’s generative process involves expressive functions: selecting data, structuring outputs, and filtering responses. For Americans using a domestically operated LLM, those expressive dimensions shape the flow of information, making the platform analogous to a communications forum rather than a neutral tool. Therefore, even if the model itself is not a “speaker” with constitutional rights, its domestic operation involves expressive decisions that implicate the public’s informational rights.”

In this post, we take a closer look at what companies actually do to shape the ways in which LLMs respond to user prompts, and in subsequent posts, we will consider further the national security and digital governance implications.

Companies rely on multiple safeguards to guide artificial intelligence (AI) model behavior, which may be applied prior to training, during model development, at deployment, or through external moderation systems that monitor outputs. Content moderation is the observation of AI-generated subject matter and regulation to adhere to company-specific guidelines, often servicing users’ safety and well-being. A few of the safeguards include:

  1. Pre-Training Safeguards

Training data applies to the development stage of AI models when companies wish to exclude harmful or biased sources. Training data consists of the information machine learning models use to learn patterns, make predictions, and ultimately generate content. After processing vast amounts of training data, an algorithm or AI model is considered trained and ready for deployment. Without training data, however, algorithms would achieve little to nothing.

Data filtering is then used to refine these datasets by retaining information that meets specific conditions set by the company. This process improves the quality of an AI model’s output by removing irrelevant data and noise. One example of data filtering that can result in direct content moderation is the implementation of text filters, which exclude data containing certain words or phrases.

  1. Training Safeguards

Safeguards that help AI models train effectively also serve as valuable tools for content regulation. One such safeguard is fine-tuning and reinforcement learning from human feedback (RLHF), which occurs after initial training. In this stage, human evaluators provide feedback that rewards compliant responses while discouraging less desirable content. Another approach is adversarial training, which leverages the mathematical nature of machine learning models by identifying weaknesses in their decision boundaries. Through iterative probing, attackers can discover minimal changes to input data that cause models to produce incorrect outputs. In revealing blind spots, adversarial methods help guide models toward more reliable and aligned behavior.

  1. Grounding Techniques

Grounding is the ability to connect model output to verifiable sources of information, improving accuracy and reliability. One example of a grounding technique includes Retrieval-Augmented Generation (RAG) which is an AI framework that combines the strengths of traditional information retrieval systems, like search engines and databases, with the capabilities of generative large language models (LLMs). By integrating external data and world knowledge with LLM capabilities, grounded generation becomes more accurate and relevant to a company’s specific needs.

  1. Deployment Safeguards

Deployment safeguards provide an additional layer of protection by moderating content at the point of user interaction. While many safety measures, like reinforcement learning from human feedback, supervised fine-tuning, and training data filtering, occur during model deployment, input and output filtering operates during deployment when a user submits a prompt. This approach relies on classifiers to evaluate other user input and model responses—input monitoring detects attempts to misuse the system while output monitoring intercepts unsafe content before it is displayed. Classifiers are machine learning models trained to categorize content, such as distinguishing credible information from misinformation. Several tools exist to detect harmful material, including Google’s Text Moderation Service and Perspective API, which score text across safety and toxicity categories. Frontier model developers also maintain custom classifier systems to help ensure safer model behavior. Some examples of this include Microsoft’s harm-category filters and Anthropic’s constitutional classifiers trained on synthetic prompts.

  1. External Moderation Tools

External moderation tools further support content regulation through AI-powered content moderation APIs. These systems use machine learning models trained on large datasets of labeled content to identify patterns associated with harmful or inappropriate material. Once they are trained, they can autonomously analyze new content and make moderation decisions based on predefined rules and guidelines. Integrated into social media platforms, websites, and other online services, these tools enable automated detection and removal of content such as hate speech, harassment, and explicit material.

Leave a Reply