agreement metrics for time series annotation task

Summary

Top 8 papers analyzed

Evaluation metrics for time series annotation task In time series annotation, there are several metrics to quantify inter-annotator agreement which indicates the reliability and consistency of annotations.The simplest metric is observed agreement which measures the proportion of instances/segments on which the annotators agree.However, this metric does not account for agreement by chance.To address this, Cohen's Kappa is often used which measures the agreement between two annotators normalized by the expected agreement.Fleiss' kappa extends this to multiple annotators. Other metrics like Krippendorff's alpha can also handle multiple annotators and different types of annotation. These metrics typically vary between 0 and 1, with higher values indicating greater agreement. Additionally, for specific types of annotation tasks, customized metrics are also used.For instance, in event annotation, metrics like temporal vagueness and temporal consistency are useful to quantify the variability in identifying event boundaries.In structured/semantic annotation, metrics based on hierarchical relationships can quantify the agreement in identifying concepts and named entities. Some studies also report multiple metrics to analyze inter-annotator agreement from different perspectives.For time series annotation, since the tasks often involve identifying and labeling segments, segmentation metrics like window diff measure are also reported along with annotation agreement metrics. In summary, there are several statistical metrics to quantify agreement for the time series annotation task by taking various factors like chance agreement,number of annotators,annotation types into account.Reporting multiple metrics provides a more comprehensive analysis of annotation reliability.

A computational psycholinguistic evaluation of the syntactic abilities of Galician BERT models at the interface of dependency resolution and training time

Finally, to observe the effects of the training process, we compare the different degrees of achievement of two monolingual BERT models at different training points. We also release their checkpoints and propose an alternative evaluation metric. Our results confirm previous findings by similar works that use the agreement prediction task and provide interesting insights into the number of training steps required by a Transformer model to solve long-distance dependencies.

Published By:

Iria de-Dios-Flores - undefined

2022

Cited By:

Inter-annotator agreement in spoken language annotation: Applying uα-family coefficients to discourse segmentation

The Valencia Espanol Coloquial model applies discourse segmentation to spoken language. This study measures inter-annotator agreement for identifying units and labeling units in a conversation. Three experts segmented a conversation into subacts, the minimal unit in the model, and labeled units with 10 categories. Statistical metrics showed high agreement, reaching 0.8 for procedure subacts. The model is validated for full pragmatic analysis of conversation.

Published By:

Salvador Pons Bordería - undefined

2021

Cited By:

A comparative evaluation of streamflow prediction using the SWAT and NNAR models in the Meenachil River Basin of Central Kerala, India.

Reliable and accurate modelling of streamflow is still challenging due to complexity, data need and inaccuracy. A neural network autoregression (NNAR) model, evaluated to replace the Soil and Water Assessment Tool (SWAT) model for data-scarce and immediate streamflow prediction. The NNAR model inputs were lagged streamflow values, outputting next-day predictions. Using 20-day windowed data, NNAR produced the best predictions. Evaluation metrics (R = 0.90, RMSE = 28.27, MAE = 11.92, R2 = 0.83) showed predicted and observed streamflow agreement. NNAR accurately predicted streamflow without understanding physical processes governing the system.

Published By:

M. S. Saranya - Water Science and Technology

2023

Cited By:

Promoting Reproducible Research for Characterizing Nonmedical Use of Medications Through Data Annotation: Description of a Twitter Corpus and Guidelines

Some abused prescription drugs mentioned frequently on Twitter. Machine learning classifiers can identify drug mentions with 73% accuracy.

Published By:

K. O’Connor - Journal of Medical Internet Research

2020

Cited By:

Deep Neural Networks Can Accurately Detect Blood Loss and Hemorrhage Control Task Success From Video

Deep neural networks predicted blood loss and task outcomes from video of simulated surgery with some success. Providing additional instrument data improved prediction accuracy.

Published By:

G. Kugener - Neurosurgery

2022

Cited By:

Retrieval-based Annotation of Multi-channel Time-Series Data for HAR

We propose a semi-automated annotation procedure employing ranked hypotheses from a deep architecture, decreasing annotation time and maintaining consistency.

Published By:

Erik Altermann - undefined

2022

Cited By:

Promoting Reproducible Research for Characterizing Nonmedical Use of Medications Through Data Annotation: Description of a Twitter Corpus and Guidelines (Preprint)

An annotated dataset was created from Tweets mentioning abuse-prone medications. The annotation had a high agreement (IAA=0.86), classifying tweets as abuse/misuse, personal use, mention unrelated.

Published By:

K. O’Connor - undefined

2019

Cited By:

Variational Autoencoders for Biomedical Signal Morphology Clustering and Noise Detection.

A model-free approach detects abnormal waveform noise without annotation. It improves blood pressure estimation from wearable bio-impedance data.

Published By:

Z. Nowroozilarki - IEEE journal of biomedical and health informatics

2023

Cited By: