Skip to content

AI-Based Content Moderation: Enhancing Trust & Safety on Online Platforms

As a professional working for an online platform, you understand that harmful content and bad behavior significantly affect the company's bottom line. Users leave when they feel harassed, while those who feel welcomed stay and spend.

Content moderation is essential to ensure user behavior adheres to your community standards. Your moderation team is essential, but they can't keep up with the speed at which users create and post content, especially as your user base expands. Artificial intelligence content moderation can scale detection and actioning content and behaviors that violate your community guidelines. It can also be deployed to identify and elevate healthy and positive behaviors, helping set a tone for the entire community.

ai-based content moderation


Using AI for Content Moderation

Content moderation leverages a subcategory of AI known as machine learning (not all AI uses machine learning, but all machine learning is AI).

Machine learning content moderation is an approach to AI where the computer learns from data. 

Two broad categories of content moderation machine learning algorithms include supervised machine learning and unsupervised machine learning. Supervised machine learning requires data to answer a specific question, whereas unsupervised machine learning uncovers patterns that we otherwise would not.

supervised vs unsupervised machine learning

Content moderation using AI leverages supervised machine learning. Data scientists use massive datasets to teach algorithms to identify and surface specific behaviors, which can be positive or negative. For instance, Spectrum Labs has machine-learning models that identify teaching and mentoring behaviors and harassment and bullying behaviors.

In content moderation, machine learning is deployed to turn raw data into detection tools or make predictions about trends. For example, we can teach a machine to identify and filter out spam emails by providing numerous examples of what spam and legitimate emails look like. The machine quickly learns to recognize the words, phrases, and formatting used in spam emails and filter them out so they never arrive in the user's inbox. The more data the AI is fed, the quicker and better it will learn.

A well-designed AI system can work quicker and more accurately than a human, which means that jobs that are time-consuming for a person can be taught to AI. Businesses can save time and money by using artificial intelligence to help with content moderation tasks.

But a challenge confounds even the best-trained machine-learning algorithms: the complexity of human behavior. In some instances, specific words and phrases are used to harass or even menace another user, but in other cases, those words and phrases are used relatively benign. The difference is context. 

What Are Different Types of Content Moderation Tools?

There are multiple tools that Trust & Safety teams can deploy for content moderation on their platforms. Broadly speaking, these tools fit into one of three buckets: Word filters and RegEx solutions, classifiers, and contextual AI. Robust moderation tools, such as Spectrum Labs Guardian AI will use all three.

Let’s take a look at each.

Word Filters, Keyword & Regular Expression ("RegEx”) Based Solutions

Word Filters, Keyword and Regular Expression based-solutions (aka, keyword/RegEx) are widely deployed content moderation tools. These tools look for toxic behavior based on specific words, such as racial slurs, popular names for illegal drugs, as well as regular expressions, “I hate [a protected class].”

These solutions are helpful for catching egregious behavior, such as common hateful terms directed at a person of a specific ethnicity, but there are shortfalls to relying on such solutions as your main line of defense. 

To begin, they require your Trust & Safety team to manage lists of words and phrases on a continuous basis. And yet even a well-maintained list presents challenges. The language of the Internet is constantly changing. Specific words and phrases may be perfectly acceptable in certain contexts, and automatically blocking them can leave users feeling censored. And if your platform is used by speakers of a language that isn’t native to your Trust & Safety team, it is likely that their lists will be inadequate for the task at hand.

In some instances, the keyword is too broad; if you set it to block the word “ass,” it may very well catch and block perfectly innocuous words like class, bass or sass.

But the gravest challenge to a word/expression-based solution is how it can leave community users vulnerable. Harassment that results in real harm can occur without ever using a single keyword on your list. The phrase, “take a different bus home today,” is clearly a threatening phrase, but won’t be picked up by keywords or regular expressions. 

It looks at a single post and not the context and progression of a conversation. The phrase, “you look cute in those shorts,” shared between two friends or sisters is perfectly acceptable, but it’s a whole other matter if it’s said by an adult to a teenager.

Classifier Based Solutions

Classifiers are message-level AI. They look at a single message or piece of content at a point in time and make a determination of its intent. For instance, classifiers are good at assessing explicit threats, such as a post that says, “I'm going to find you and kill you.” In general, however, classifiers look at a message without any additional context (e.g., who said it, to whom, and how it relates to other things that may have been said in previous exchanges).

While classifiers are good at detecting behavior at the message level, in general, they don’t consider the context. Certain posts and words can seem innocuous if taken out of context, while other words considered bad may be perfectly appropriate for adult users in other situations.

classifier based solution

An additional challenge is that global toxic classifiers are one-size fits all, which makes them too rigid or too broad for practical uses. Companies like Google and Amazon have classifiers that address toxicity, but many social media and online community companies prefer not to use them. Each platform has its own definition of what is and is not allowed. A platform meant for children will have a very different definition than an online game that caters to young men or a dating app meant for adults. 

Contextual AI Based Solutions

Contextual AI analyzes data within its context, meaning it looks at content (the complete raw text) and context (e.g., attributes of users and scenario, frequency of offender) to classify a behavior. 

Importantly, contextual AI can look across aspects of a platform -- posts, images, text, private chats, messaging -- and tie multiple of them together for an assessment. This approach to AI is critical for Trust & Safety teams because it is the only way to stop complex behaviors, such as radicalization, bullying, and grooming, that occur over multiple interactions. At its core, contextual AI looks at how behaviors build over time and how users respond to different messages to distinguish between consensual conversations and those that are not. 

Let's say you’re a gaming platform and 10 of your players are "talking trash." As a Trust & Safety professional, you want to ensure that it's playful banter and not bullying behavior. Context identifies the sentiment.

Once contextual AI identifies inappropriate behavior and the reason why it violates a community standard, it can enable the platform to automate an action against the offender (e.g., issue a warning, suspend the user for a brief period of time, or even ban the user altogether).


Active Learning

Language is far from static. Words change meaning over time or mean different things to different generations or communities. New phrases emerge as a result of news events, cultural phenomena, and a host of other activities. Knowing what those phrases mean and their connotation requires continuous human input to ensure the model is up-to-date with emerging behaviors.

AI models improve through active learning or "human in the loop" tuning cycles. Active learning consists of customer feedback, moderator actions (e.g., de-flagging a piece of text that was incorrectly flagged as profanity), and updating the language model with emerging slang and connotations. Additionally, active learning also requires a data science team to review model performance regularly. These inputs are fed back into the data vault and the system-training cycles.

What Goes Into Building Content Moderation AI In-House?

Building a content moderation AI tool in-house is a pretty monumental undertaking, one that requires several types of AI, as well as a robust team of data scientists to ensure training data is labeled correctly so that the algorithms can learn to predict which behaviors violate community standards, and which are healthy and positive.

Data Vault

All machine learning requires a massive dataset to train its algorithms to accurately predict answers to specific problems. The same is true for machine learning meant to augment human content moderation teams.

To train an algorithm in a Trust & Safety context, the data scientist must provide the machine with positive and negative examples of the behavior of interest. One of the biggest challenges to creating a contextual AI moderation tool is finding the right dataset.

What’s more, every piece of data must be labeled correctly so the machine will learn how to process it. You will also need to find the right people to label data. Labelers should be educated and highly fluent in the language used in the data they will evaluate. Diversity in labelers and a standardized vetting process are essential.

Spectrum Labs’ data vault is the world's largest AI training data set built specifically to capture harmful and positive behaviors. It is based on behaviors observed across all client platforms, enabling us to apply the insights learned from one platform to all others that use Spectrum Guardian Content Moderation AI. 


But first, a word on privacy: Spectrum Guardian was built for GDPR compliance and never stores or uses personally identifiable information (PII). We deploy a process known as pseudonymization to all content that stems from a client’s platform. Pseudonymization replaces every piece of personal data with an artificial identifier or pseudonym. In our case, user data is sent to a third party that performs the masking for us, which means that Spectrum never receives, stores, or processes any data that can contribute to the identification of a user.     

All of the data collected from client platforms are imported to Spectrum’s data vault. That anonymized data from the data vault is then fed into the tuning of the models. With every API call and corresponding behavior determination, the data vault is enriched, and the models are honed, enabling them to predict the user's intent more accurately. The behavior determination data flowing into the vault becomes the flywheel.

Data Research

Artificial Intelligence is only as good as the data that trains it. At Spectrum Labs, we have invested heavily in a rock-solid data operations workflow to ensure a broad and rich understanding of human behaviors on our customers’ platforms that host user-generated content.

Data Flows

We begin with a variety of public data sets and, depending on the behavior to capture, we leverage scapers to get domain-specific public data. Spectrum Labs works with well-recognized researchers for specific domains to get specialized data sets, such as Safe from Online Sex Abuse and the Center on Terrorism, Extremism, and Counterterrorism.

Finally, Spectrum Labs has a Department of Research that conducts in-depth research around specific behaviors that are used as data sources. Gathering and feeding these data sources into the system is an ongoing process.

Data Labeling and Lexicons

Spectrum Labs uses a cross-language model (XLM-RoBERTa) model to train our algorithm to understand behaviors across multiple languages. 

The XLM-RoBERTa (XLM-RoBERTa) model is a type of language model used for natural language processing tasks (NLP), such as machine translation, language modeling, and text classification. It is specifically designed to be multilingual, meaning it can handle multiple languages and transform knowledge across them. It is trained on a large data set of text from multiple languages, enabling it to learn common patterns and relationships across languages.

A key feature of XLM-RoBERTa is its use of self-attention mechanisms, which allows it to determine and focus on the most important parts of the input text when making predictions. Self-attention mechanisms help the models understand the context and meaning of ambiguous language.

To train our XLM-RoBERTa model, we begin by defining a lexicon. A lexicon is a crucial step that specifies the exact definitions of what constitutes the range of different behaviors we want the AI to surface and capture in both text and audio that's generated on a platform. Examples of lexicons include Hate Speech and Grooming on the negative side and Teaching & Inviting on the positive side.

The lexicon is the specification that is used for labeling. Large samples are taken from the data vault, then split into three different data sets:

  1. Training: Used for training the different behavior models.
  2. Testing: Used for testing the performance of the models.
  3. Evaluation: Used to evaluate the precision, recall, and accuracy of runtime determinations.

data labeling and lexicons (2)

Through an extensive network of vetted native language experts, the sample data set is labeled according to the lexicon specification. From there, the labeled training data set is used to train the transformer models, while the labeled testing data set is used for quality assurance cycles.

Natural Language Processing

We already discussed machine learning. Another type of AI that is required for moderation is natural language processing or NLP. NLP as a field has received a lot of attention of late due to the rollout of generative AI tools such as ChatGPT and Bard. Both tools use natural language processing to help them create text that feels as if it was created by a human.

What is NLP exactly? NLP is an area of computer science that focuses on the interaction between computers and human (natural) languages. Data scientists seek to develop models that are capable of processing and analyzing vast amounts of natural language data. NLP is particularly useful in monitoring live-streamed events, as it can pick up on non-verbal signals and context.

It’s difficult to overstate the importance of context. A 14-year-old boy may tell his best friend, “I tripped and fell right in front of Julia in math today. I should just kill myself.” In this situation, the writer probably won’t commit suicide. However, it’s another matter altogether if a recently divorced person is concerned about financial issues and says, “I want to kill myself.”

A challenge for content moderators is that they must review large volumes of information quickly. NLP can serve as a first-line assessment tool, distinguishing real threats from hyperbole so that the more likely threats are sent to a human moderator close to real-time for faster review.

There are numerous other advantages of NLP. For instance, the more it's used, the better its models will get at successfully identifying the content that moderators need to see. It also allows for greater scale, enabling more conversations to be analyzed for troubling content. And as the cost of supercomputing and AI development go down, more communities will be able to deploy it to keep their users safe.

Learn more: Natural Language Processing AI for Communities

Data Training and AI Content Moderation

When deciding whether the content is toxic, prosocial, or neither requires definitions that are large enough to encompass the target situations that we want to surface but restricted enough to limit the bias and subjectivity of the data labelers. By definition, data training is an iterative process as both language and platforms evolve. 

When Spectrum Labs receives disagreements in our labels, it is a valuable signal for us that the lexicon either needs more examples to better illustrate the behavior we want to target or that the labelers need more training in order to eliminate bias or subjectivity.

Let’s take the example of identifying threats made against a user by another user on a platform. Obviously, we begin with a definition of a threat. If we start with a very generic definition of a threat, it can easily include the common trash-talk that occurs on a gaming platform, which is then perceived as a threat within that lexicon that definition. The result is numerous false positives.

To separate the real threats from trash-talking, we need to look at other factors that speak to legitimacy, feasibility, and other details that will provide data labelers with a better measuring stick for assessing if it’s a real threat or not. In turn, those labelers will create more accurate labels that can be used to take action downstream.


Behavior Determination with AI

Once the lexicon is defined and the data is labeled accurately, Spectrum Labs behavior models are deployed in Spectrum Labs’ production environment to determine behaviors at incredible speed and scale.

The behavior determination process follows this process:

  1. A platform consumes a string of user-generated content (UGC) which is then sent to the Spectrum API for AI content analysis.
  2. Spectrum then performs pre-processing on the UGC, and
  3. Reviews the user's metadata, and
  4. Reviews the user's history, and
  5. Reviews the custom detection list, with keywords to capture that each platform may specify. Next, we
  6. Run the string through all the behavior models deployed
  7. Run the string through any custom models that a platform has provided through
  8. Spectrum Labs Bring Your Own Model framework
  9. Arrive at a boolean (i.e., “true” or “false”) determination for each behavior.

The entire process is completed in under 20 milliseconds. To achieve scale, our API currently processes billions of pieces of user-generated content (UGC) every day. The evaluation-labeled data set is then used to automatically perform Accuracy, Precision and Recall, and Accuracy analysis.

Additionally, all UGC data that the behavior determination cycle processes are fed into our data vault to be anonymized. This is accomplished via a transformer neural network, which is a deep learning model designed to process sequential data by jointly encoding and decoding input sequences using self-attention (i.e., selecting the most important inputs) mechanisms.



Large Language Models vs. Advanced Behavior Systems

Large language models (LLMs) have become very popular recently. However, behavior is determined by more than just the XLM-R model and data.

Where Spectrum Labs’ technology and LLMs differ is that LLMs use the open internet to learn and lack domain-specific active learning cycles. That means the LLMs may learn things that are false (e.g., from Reddit), but the humans in charge of providing feedback into the learning models may not know that it’s false. This was recently seen at the launch event of Google Bard, where LLM mistakenly attributed the first photo of a planet outside our solar system to the James Webb Telescope.

In Spectrum Labs’ case, the data that models learn from is carefully sourced and curated to train our models on one specific type of behavior. Additionally, active learning with human feedback is achieved by employing specialists in the areas of Trust & Safety, and language. As a result, Spectrum Labs’ models are directed to a narrow domain, and reinforcement learning is fueled by experts in that domain.

Measuring Content Moderation Tool Performance

Whether you build a content moderation tool in house or purchase content moderation AI from a provider, you will want to know it is capable of doing things you want it to do such as identifying content and posts that violate your community guidelines, and surfacing content you want your team of moderators to address.

The key is to ask how the platform moderation tool strives for and measures performance. Performance can be measured using a variety of metrics such as accuracy, precision, recall, false positives, and false negatives. These metrics can help identify improvement areas and optimize content moderation tools' performance.



Accuracy measures how well a model is able to make correct predictions on new, unseen data. It helps us evaluate how well a model performs and whether it is suitable for a particular task.

To calculate accuracy, we add up the number of times the classifier correctly identified a positive instance of a behavior (True Positive) and the number of times it correctly identified a negative instance (True Negative) and then divide that by the total number of instances in the dataset.

Operational Precision

Operational precision refers to the ability of the content moderation tool to accurately identify and remove harmful content without mistakenly flagging benign content. A high level of operational precision means the tool is effectively identifying harmful content while minimizing the impact on non-violating content.

Screenshot 2023-04-18 at 4.29.18 PM


Challenges with AI in Content Moderation

There are numerous challenges with AI content moderation, all of which must be overcome in order to detect and surface specific behaviors. Let’s look at them.

Leet Speak and AI Content Analysis

l337 (leet) is an online language originally used by hackers to prevent their sites and groups from being discovered via AI content moderation tools. The term comes from the word "elite"—as in, those who knew leet speak were an elite group of hackers and, eventually, skilled video gamers. 

Before the web existed the way we experience it today, some people participated in bulletin board systems (BBS), which did not always host discussions about activities that were legal. Leet speak gave users a way to talk about whatever they needed to talk about without being discovered. 

Leet speak ​​creates challenges for content moderators who use keyword based tools for user-generated content moderation to identify banned material or behaviors.

Accuracy in AI-Based Content Moderation

Spectrum Labs offers a set of fields to our API Response in order to provide platforms with additional signals to use when making decisions about whether to take action against a user and, if so, what action should be taken.

These fields are correlated to different confidence buckets, which reflect how confident the model is that the behavior is present in the data. At present, we offer three levels of buckets:

Screen Shot 2021-12-03 at 2.01.51 PM

For example, here is an idea for how you might want to incorporate confidence for Hate Speech results:

Screen Shot 2021-12-03 at 1.25.36 PM

For more information, see Spectrum Labs discuss
driving a high detection accuracy rate for toxic and spam users.


Language Detection for Global Platforms

Content moderation is extremely complex for platforms that extend their reach to new communities or regions and attract users who speak diverse languages. Adding new languages isn't as simple as translation. A term that's harmless in one language can be toxic in another. Translation apps are inaccurate and slow and miss implied meanings.

Problems with hate speech and violent extremism can proliferate if the platform lacks local moderators and expertise to monitor for emerging terms. And yet most AI based content moderation solutions face obstacles to adding new languages as they require hiring local speakers to construct keyword lists and building up language-specific data to train models over time.

It can be a long, expensive process to add each new language, leaving users unprotected.  To address this challenge, Spectrum has developed a patented, real-time, automated AI solution that helps platforms scale globally faster, at a lower cost, and with better results.

The solution supports:

  • A wide range of languages, including character-based, hybrid, and l33tspeak 
  • Localized, customized, automated actions
  • Insights for regions, languages, and user behaviors 

Spectrum Labs uses AI language detection that transfers learning from one language to another, allowing you to add new languages immediately, then refine over time. Our contextual AI directly analyzes content in its native language and compares patterns to those found in languages where the most data exists to make the right determinations across languages. 

language detection

This multi-language approach removes blind spots due to not having moderators everywhere, enabling platforms to benefit from Spectrum Labs investment in native-speaker experts and data and our patented AI multi-language approach. 

For a detailed explanation of how our AI solution works, please refer to our white paper.

Changing Behaviors

Models need to be updated on a continuous basis in order to accommodate new terms that enter the lexicon or to better reflect a platform's evolving community guidelines. This is why Spectrum Labs’ workflows and processes around advanced behavior systems is so important.

To enable platforms to keep up with emerging language, Spectrum Labs offers a feature known as a custom detection list, which enables individual platforms to include additional terms not currently flagged by one of our models.

This list can be useful for a variety of scenarios, including:

  • Anything that may be missed by the current solutions but is important for a customer to be able to flag
  • A Zero Tolerance list for terms that may or may not be covered by other solutions' definitions
  • Terms that are outside of the scope of the licensed behavior's definitions
  • Known bad terms in languages outside of what was licensed.

Scaling AI for Content Moderation

As a platform grows its user base, it will quickly overwhelm its team of human moderators. Fortunately, supercomputers can crunch through data at enormous velocity, enabling them to recognize patterns and warning signals well before a human can. 

Additionally, a content moderation algorithm can also be deployed on a number of channels at once, providing consistent analysis and feedback to all. AI models and computers don’t suffer from mental fatigue; their eyes don’t gloss over.

AI based content moderation solutions are also better at pinpointing patterns that a human might miss. Early pattern recognition is essential to protecting community members. Certain patterns, such as an adult male asking a pre-teen girl what she wore to school that day, can be detected quickly by AI, whereas a human might not identify that relationship as grooming until much later.

While human content moderators can interpret the nuance and context of an interaction, they do not approach these as consistently as an AI-based solution. No matter how skilled or well-trained your moderators are; or how clearly your community guidelines have been communicated, moderators still have an overwhelming workload and expectations of productivity that create extreme cognitive stressors. These provide exactly the conditions in which a person’s unconscious bias can float to the surface and affect their instinctive responses.

Until recently, it has been difficult to train AI algorithms to address the more complex nuance, context, and variables of human interactions. But as AI models become more refined and more training data is available, these tools are becoming better and better at ‘reading the room’ - providing more accurate identification of the nuance and context of complex online interactions.

Finally, AI helps scale detection of specific behaviors in cases where a platform has a large user base. The Trust & Safety teams can configure different types of AI-driven actioning. In these scenarios, the AI will automate the actions taken (see below for more detail). 

Click here for more information about managing machine learning life cycles at scale.


Types of Content Moderation Actioning

Automated Actioning on Behaviors

Automated actioning helps content moderators do their jobs as efficiently as possible. Without a technology solution, keeping their platforms safe would require the content moderation team to examine each and every post to ensure users abide by the community guidelines. Obviously, this is an impossible scenario. 

Role of Content Moderators in Actioning

Every day, content monitors are tasked with reviewing posts and making judgment calls -- e.g., is this hate speech, and if so, what action should I take on this user? Is this an anomaly for the user or part of a pattern?

Once they make a determination that a post violates community standards, they must take some everyday action on it. Both the determinations and actions taken will vary from platform to platform. 

In some cases, community guidelines may require a moderator to issue a warning; for instance, is this the first time a user said something offensive? Perhaps the user is simply having a bad day and needs to be told, “we don’t consider that an acceptable way to address members of our community.”

In other instances, behavior that is acceptable on other popular platforms may not be appropriate for another. In such instances, the action called for in the community guidelines is to send the user a link to the community standards. 

Screenshot 2023-05-18 at 7.53.11 PM

Conversely, if a user has been repeatedly told about an infraction, the moderator may impose a temporary or permanent ban, depending on the rules of the platform. 

While the rules of conduct vary from platform to platform, what remains the same is the need to apply them uniformly to a user base.

Augmenting the Human Moderating Team

To see how technology can augment the work of the content moderation team, let’s look at Spectrum Labs Guardian solution as an example.

Guardian has 20 behavior models, ranging from hate speech and bullying, to extremist recruitment and grooming. All content and text that is posted to a site can be reviewed against any combination of these models and actioned upon in accordance with community guidelines.

How it Works

If a community uses Guardian, any time content is posted, it is treated as input and is sent to the Spectrum Labs API for processing against a behavior model. An analysis is sent back which includes a determination that forbidden content was or was not detected, along with a confidence level. All determinations are grouped into confidence buckets, which as we’ll see in a little bit, help content moderation teams make important decisions. This basic analysis and determination process takes less than 20 milliseconds. 

Screenshot 2023-05-18 at 7.55.49 PM


Automated Decisioning & Response on Message-Level Content

In some cases, clients want Guardian to augment their content moderator team by automating some of the actions that humans typically perform. For instance, a platform’s anti-hate speech policy may be to send offenders its community guidelines with a warning. 

Guardian can automate this action because our solution can support a set of rules that says if a user’s input (e.g., content) contains hate speech in English with a high confidence level, the output is to show a content-action tag that contains the platform’s community guidelines. 

The determination → Response → content-action tag provides a mechanism that platforms can use to automate actions on message level content.

Note: Spectrum Labs does not decide the rules or community guidelines, nor can we show any content within our clients’ platforms. We send a response back to the API with a determination, confidence level, and action tag.

Content Actions vs. Event-Based Actions

Content actions occur in real time because they include a determination and response. APIs are always input/output. I send something in, and I get a response back.

There is another type of actioning that is based on webhooks and not API responses. Webhooks are a way for web applications to provide real-time information to other applications or services. Think of them as push notifications that are triggered by an event in one application and delivered to another application via HTTP POST requests.

In content moderation terms, a webhook helps evaluate a set of rules a platform has set, and determine when an action is necessary. Let’s say a platform's policy is to warn offenders the first two instances that they use hate speech, and to ban them the third time they do so. The first two instances that the user violates the hate speech policy he or she will receive a warning, and nothing more. The third time, however, the platform will execute an action. We call this event-based actioning, meaning we are waiting for a rule to come true before executing an action.

Some event-based actions may be triggered when specific conditions occur, such as a user aged 25 or above sends a message to a user who is aged 12 or younger. In such cases we use metadata to trigger an event-based action.

Platforms can set a variety of rules that when met indicate a specific condition, such as grooming or recruitment, is true. When conditions are true, an action is automatically triggered.

Triggering a rule means a webhook is fired, using a JSON format that is very similar to the API described above with one exception: the webhook isn’t a reaction to a content input, it is based on the rules the platform has set.

Screenshot 2023-05-18 at 7.58.30 PM


Building Cases

Let’s say a user posts content that the model determines is hate speech with a high level of confidence. In addition to sending a warning, the platform can also add it as a case to the human moderator’s queue. This will allow a moderator to reach out to the offender directly if desired.

The approach combines automation (e.g. the content tag with a warning) with a human touch to drive home the point that such behaviors are not tolerated on the platform.

User Reports

User reports allow users to report other users whom they feel violated community standards. Let’s say user A files a report saying that user B is bullying them. Those exchanges will have been analyzed by Guardian, which made a determination, but the confidence level may have been low. In such cases, the conditions may not have been met to say this is an instance of true bullying. That low confidence level may be due to some ambiguity, but a user report can clear up that ambiguity, enabling a content moderator to step in with an appropriate action.

This process can also be automated by implementing rules that say if we have instances of a violation with low confidence and we receive a user report with the same claim, the conditions are met to take an action against the violator. 

User-Level Moderation

Content that isn't automatically removed through real-time API or automation gets sent into the moderation queue. In the queue, moderators can look at each piece of toxic content by message or user-level.

User-level moderation allows moderators to identify and focus on the worst cases quickly and easily, so they aren’t left unaddressed. Spectrum Labs approach combines individual user reputation scores, user-level moderation, and behavior detection trends. Let's see how they each contribute to the solution.  

User-level moderation lets Trust & Safety teams view cases that are prioritized by severity and grouped by user. This feature allows them to see at a glance the number of severe cases mapped to a single user and focus on that user first. Moderators can also see all user-level information and manage multiple cases for a single user simultaneously. This enables them to make better decisions with complete, not partial, information.

Moderation on a user-level keeps the community safer by enabling faster, automated, and escalating actions against repeat offenders. Moderators are presented with their community's recommendations for user-level action (which they can override if needed). This drives efficiency by allowing moderators to handle more cases at once and prevents future cases from a toxic users.

Measuring Community Health and ROI on Content Moderation

Spectrum Labs AI detection solutions can help Trust & Safety teams detect individual behaviors, such as hate speech, radicalization, and threats, but are those one-off issues? Or are these cases indicative of a larger problem on your platform? 

To help clients answer that question, we combine metrics about top toxic and at-risk users with other factors, such behaviors, languages, community areas, and user attributes. This allows you to see at a glance the frequency of infractions and detect behavioral trends that may occur on your platform.

We deploy a variety of tactics to identify behavior trends. The first is an AI-driven behavioral detection feature. When a user creates text content, Spectrum's API validates it against behavior solutions. The API returns results for the content, including which behaviors were flagged and confidence levels. 

Our dashboards allow you to filter by behaviors, time ranges, content categories and more. You can also choose how to visualize the data and export it for distribution or further analysis, such as overlaying metrics with your other key performance indicators to identify influencers, analytics, and insights help you make better decisions about how to shape your community. 

Spectrum Labs can help you:

  • Identify which problems or chatroom to address first
  • Learn which users drive toxicity
  • Understand how behaviors interrelate
  • Inform policies for different languages
  • See emerging patterns to address early.

These insights allow you to identify the types of problems that happen on your platform so that you can provide any additional training for your moderators, as well as update your community policies as you need to.

All three product features – user reputation score, user-level moderation, and behavior detection trends – are fully privacy compliant.

How to Integrate a Content Moderation Vendor

There are multiple ways to integrate a platform with a content management vendor. Spectrum Labs can integrate directly into a platform’s technology infrastructure either via a real-time or asynchronous scenario.


In a real-time scenario, a platform will use an API key and an account identifier to make a JSON request to the Spectrum API. A JSON request is an HTTP request that uses the JavaScript Object Notation (JSON) format to send data between a client and a server. The data is included in the request body as a JSON object, and the server processes the data and returns a JSON response. Spectrum’s API will respond with a JSON payload that indicates the specific behaviors detected, along with the message content and metadata. 

API’s can be used for:

  • Single messages
  • Up to 100 batch messages
  • User reports

At that point, your system may decide to redact the content without further actions, or it may consume the message and await a webhook invocation.


The second, asynchronous scenario, offers more flexibility in that you can define when the webhook should be fired. A webhook is a mechanism that allows one application to send automated messages or data to another application through a simple HTTP request.

In a content moderation scenario, a webhook is a notification sent from Spectrum Labs API to your API based on predefined scenarios. When webhooks are fired, your web API decides to take automated action or send the content to your content moderation team for review.

Learn More: API & Implementation

About Spectrum Labs Content Moderation AI

Spectrum Labs works with online communities to help them stop disruptive behavior. We do that by offering customized AI content moderation that's customized to your community guidelines.

Our behavior identification solution is made up of a number of different methods:

  • Lookups. Think of them as keywords. While a rudimentary solution on its own, lookups can be an incredibly powerful tool when combined with other methods.
  • Classifiers. Classifiers identify behaviors such as hate speech, insults, and self-harm within a single piece of data.
  • Historical Context. Aspect model is a statistical technique used in NPL to identify different aspects or topics discussed in a text. It allows us to extract and cluster words that are related to specific aspects, which can be useful for sentiment analysis. Our aspect models are also trained via our data vault. These models look at the full context of a conversation, a user stream, a thread, and they're able to identify those complex behaviors.
  • User Reputation Scores. This important score allows human moderators to see at-a-glance where to focus their attention and take action.
  • Metadata. Metadata is used as an input user reputation scores, aspect models, and classifiers. It's also an important signal on its own


Spectrum Labs clients are free to combine these different results into an accurate and powerful signal they can trust to keep their community safe.

Contact Spectrum Labs Today

Whether you are looking to safeguard your audiences, increase brand loyalty and user engagement, or maximize moderator productivity, Spectrum Labs empowers you to recognize and respond to toxicity in real-time across languages. Contact Spectrum Labs to learn more about how we can help make your community a safer place.