MURMURATE’S
DATA LABELLING APPROACHUnderstand how our approach to labelling data differs from others.
Understand how our approach to labelling data differs from others.
MURMURATE’S DATA LABELLING
HAND-LABELLING DATA
Case Study: COVID-19 Misinformation in Tweets
MURMURATE’S DATA LABELLING
Case Study: COVID-19 Misinformation in Tweets
A large training dataset (circa 1,000,000 tweets ) is collected which includes examples of what the final AI model will need to recognise. A large training dataset (circa 1,000,000 tweets ) is collected which includes examples of what the final AI model will need to recognise. Next, criteria is established for labelling tweets as misinformation. For example, at the most basic level this might be; ‘Does this tweet include false information? If yes, label as misinformation’. Tweet is labelled as misinformation hoax False Information Next, criteria is established for labelling tweets as misinformation. For example, at the most basic level this might be; ‘Does this tweet include false information? If yes, label as misinformation’. Tweet is labelled as misinformation hoax False Information Have people familiar with the topic go through each tweet and apply this criteria until at least 50,000 instances of misinformation have been identified and labelled. Subject Matter Expert 1,000,000 Tweets Have people familiar with the topic go through each tweet and apply this criteria until at least 50,000 instances of misinformation have been identified and labelled. Subject Matter Expert 1,000,000 Tweets Presuming only 5% of tweets are misinformation, and it takes 5 seconds to read and decide if the criteria applies for each tweet, this process would take 1388 hours or 173 days work if it were one person . 1388 HOURS X1 = Or 173 Working Days (7.5 months) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 29 23 30 24 31 25 27 27 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 29 23 30 24 31 25 27 27 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 29 23 30 24 31 25 27 27 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 29 23 30 24 31 25 27 27 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 29 23 30 24 31 25 27 27 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 29 23 30 24 31 25 27 27 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 29 23 30 24 31 25 27 27 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 29 23 30 24 31 25 27 27 28 Even with 10 people labelling, this process would take almost a month. X10 = 17.3 DAYS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 29 23 30 24 31 25 27 27 28 Presuming only 5% of tweets are misinformation, and it takes 5 seconds to read and decide if the criteria applies for each tweet, this process would take 1388 hours or 173 days work if it were one person . 1388 HOURS X1 = Or 173 Working Days (7.5 months) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 29 23 30 24 31 25 27 27 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 29 23 30 24 31 25 27 27 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 29 23 30 24 31 25 27 27 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 29 23 30 24 31 25 27 27 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 29 23 30 24 31 25 27 27 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 29 23 30 24 31 25 27 27 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 29 23 30 24 31 25 27 27 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 29 23 30 24 31 25 27 27 28 Even with 10 people labelling, this process would take almost a month. X10 = 17.3 DAYS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 29 23 30 24 31 25 27 27 28 The bulk of the labelled data is used for training the AI Model to be able to make predictions of misinformation on it’s own. LABELLED TRAINING DATASET UNTRAINED AI MODEL TRAINING THE MODEL The bulk of the labelled data is used for training the AI Model to be able to make predictions of misinformation on it’s own. LABELLED TRAINING DATASET UNTRAINED AI MODEL TRAINING THE MODEL A smaller set from the hand-labelled data (known as a ground truth data set) is used to validate the trained AI Model’s predictions. MODEL PREDICTIONS UNLABELLED DATA TRAINED AI MODEL TESTING THE MODEL GROUND TRUTH DATASET VALIDATION A smaller set from the hand-labelled data (known as a ground truth data set) is used to validate the trained AI Model’s predictions. MODEL PREDICTIONS UNLABELLED DATA TRAINED AI MODEL TESTING THE MODEL GROUND TRUTH DATASET VALIDATION If the AI Model needs updating as the topic or language of COVID misinformation changes, the data must be re-labelled by hand all over again. Subject Matter Expert Re-labelling time: 1388 HOURS If the AI Model needs updating as the topic or language of COVID misinformation changes, the data must be re- labelled by hand all over again. Subject Matter Expert Re-labelling time: 1388 HOURS A large training dataset (circa 1,000,000 tweets ) is collected which includes examples of what the final AI model will need to recognise. A large training dataset (circa 1,000,000 tweets ) is collected which includes examples of what the final AI model will need to recognise. Design a set of rules for labelling which tweets are misinformation, called labelling functions. Murmurate automatically labels all 1,000,000 tweets according to these rules. LABELLING FUNCTIONS: Label as ‘misinformation’ if the text contains: Label as ‘misinformation’ if the text contains: vaccines gates bioweapon hoax microchip control lethal who.org philanthropy AND AND OR OR OR NOT NOT a b Design a set of rules for labelling which tweets are misinformation, called labelling functions. Murmurate automatically labels all 1,000,000 tweets according to these rules. LABELLING FUNCTIONS: Label as ‘misinformation’ if the text contains: Label as ‘misinformation’ if the text contains: vaccines gates bioweapon hoax microchip control lethal who.org philanthropy AND AND OR OR OR NOT NOT a b Each rule taken on its own does not accurately detect misinformation. A labelling model combines these labelling functions, comparing them against each other to assign a probabilistic weight to each that is used in training the AI model. The model finds the best combination of Labelling Functions Multiple Labelling Functions Weighted Labels LABELLING MODEL a c e b d f Each rule taken on its own does not accurately detect misinformation. A labelling model combines these labelling functions, comparing them against each other to assign a probabilistic weight to each that is used in training the AI model. The model finds the best combination of Labelling Functions Multiple Labelling Functions Weighted Labels LABELLING MODEL a c e b d f The process of Murmurate automatically labelling the 1,000,000 tweets with the generated labels takes as little as 3 to 4 hours. 3 to 4 HOURS MURMURATE LABELLED TRAINING DATASET 1,000,000 TWEETS The process of Murmurate automatically labelling the 1,000,000 tweets with the generated labels takes as little as 3 to 4 hours. 3 to 4 HOURS MURMURATE LABELLED TRAINING DATASET 1,000,000 TWEETS The bulk of the labelled data is used for training the AI Model to be able to make predictions of misinformation on it’s own. LABELLED TRAINING DATASET UNTRAINED AI MODEL TRAINING THE MODEL The bulk of the labelled data is used for training the AI Model to be able to make predictions of misinformation on it’s own. LABELLED TRAINING DATASET UNTRAINED AI MODEL TRAINING THE MODEL A smaller set of labelled data (known as a ground truth data set) is used to validate the trained AI Models’s predictions. MODEL PREDICTIONS UNLABELLED DATA TRAINED AI MODEL TESTING THE MODEL GROUND TRUTH DATASET VALIDATION A smaller set of labelled data (known as a ground truth data set) is used to validate the trained AI Model’s predictions. MODEL PREDICTIONS UNLABELLED DATA TRAINED AI MODEL TESTING THE MODEL GROUND TRUTH DATASET VALIDATION If the AI Model needs updating as the topic or language of COVID misinformation changes, the data can be quickly re-labelled again with adjusted rules. LABELLED DATA UNLABELLED DATA ADJUSTED LABELLING MODEL Re-labelling time: 3 to 4 HOURS If the AI Model needs updating as the topic or language of COVID misinformation changes, the data can be quickly re-labelled again with adjusted rules. Re-labelling time: 3 to 4 HOURS MODEL PREDICTIONS UNLABELLED DATA ADJUSTED LABELLING MODEL TOTAL AI PROJECT TIME:
Several months, or even years.
3rd Iteration 2nd Iteration 1st Iteration 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 29 23 30 24 31 25 27 27 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 29 23 30 24 31 25 27 27 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 29 23 30 24 31 25 27 27 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 29 23 30 24 31 25 27 27 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 29 23 30 24 31 25 27 27 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 29 23 30 24 31 25 27 27 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 29 23 30 24 31 25 27 27 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 29 23 30 24 31 25 27 27 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 29 23 30 24 31 25 27 27 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 29 23 30 24 31 25 27 27 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 29 23 30 24 31 25 27 27 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 29 23 30 24 31 25 27 27 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 29 23 30 24 31 25 27 27 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 29 23 30 24 31 25 27 27 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 29 23 30 24 31 25 27 27 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 29 23 30 24 31 25 27 27 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 29 23 30 24 31 25 27 27 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 29 23 30 24 31 25 27 27 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 29 23 30 24 31 25 27 27 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 29 23 30 24 31 25 27 27 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 29 23 30 24 31 25 27 27 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 29 23 30 24 31 25 27 27 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 29 23 30 24 31 25 27 27 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 29 23 30 24 31 25 27 27 28 TOTAL AI PROJECT TIME:
A few hours, or a day or two.
3 to 4 HOURS 3 to 4 HOURS 3 to 4 HOURS 1st Iteration 2nd Iteration 3rd Iteration Murmurate’s automated labelling process means that you can label your data in hours, not weeks or even months.
MURMURATE’S
DATA LABELLING
3 : 4 5 : 5 5 INCREASED SPEED
The main benefit over other labelling approaches is speed. Instead of taking weeks and months, datasets can now be labelled in a matter of hours.
1 : 20 : 57 ITERATE QUICKLY
Your AI model may need to be updated to keep up with changing needs or data drift. Because our data labelling approach is so quick, it’s easy to relabel data and retrain your model.
REDUCE COSTS
Achieve in hours and with just one person what would typically take a large team many months. Traditional ways of labelling data are slow and costly, putting AI out of reach of many businesses.
INCREASED ACCURACY
With its reduced costs, our faster labelling approach means much larger training datasets can be created, which in turn creates more accurate models.
MURMURATE
NOW ANYONE CAN BUILD AI.
Murmurate’s unique and innovative approach is transforming the process of building AI and bringing it within the reach of many more businesses than before, both small and large.
Book a demo for your team to learn how it works. Bring the power of AI to your business.
Contact Us AI POWERED MEDICAL + HEALTHCARE AI POWERED RETAIL + MARKETING AI POWERED HUMANITARIAN SERVICES AI POWERED SECURITY + DEFENSE AI POWERED ENVIRONMENTAL SERVICES AI POWERED FINANCIAL SERVICES AI POWERED HOUSING + DEVELOPMENT