Statistical Significance Testing using Linear Mixed Effect Models¶

Introduction¶

In this technical note we demonstrate how to conduct model based statistical hypothesis testing using the python library pymer4. This code allows you to replicate the reanalysis of the Kreutzer et al. dataset presented in the significance chapter of the book.

In this study the researchers wanted to improve the performance of a pretrained machine translation system by incorporating human translation quality judgments in an reinforcement learning mechanism. They investigated two modes of correction. The first mode is called "Marking". In this mode the annotators simple mark words that are wrong. The feedback is incorporated in the objective function in a way that the objective is maximized when the probability of the correct (non-marked) tokens of the translated sequence is increased and/or decreased for the incorrect (marked) tokens. The second mode is called "Post Edit". In this mode the annotators corrected the translations. The corrected translation is then used as a new gold standard translation and the system is trained on these new gold standards. The quality of machine translated sentence was obtained by calculating TER, BLEU and METEOR scores relative to the original gold standard translation. To avoid redundancy we limit our showcase to TER evaluation.

The obvious question is "What feedback improves the baseline system the most?". In order to answer this questions Kreutzer et al. applied the baseline and the models trained on marking and post edit feedback on all sentences of the hold out set (n=1041).

Remark: For the purpose of analyzing evaluation data we do not use models (in our case LMEMs) as predictive devices (the typical way they are used in machine learning). Instead we view models as descriptions of the (to use at least partially unknown) random process that generates the data and use them to learn properties of this process.

Essential Libraries and Structure of the Input Dataset¶

Load the relevant libraries:

In [1]:

#The authors of pymer4 recommend to add the following lines when pymer is run inside a jupyter notebook.
import os
os.environ['KMP_DUPLICATE_LIB_OK'] = 'True'

In [2]:

import numpy  as np
import pandas as pd
from pymer4.models import Lmer # just import the linear mixed models class 
import scipy.stats as stats 

In order to fit a linear mixed effect model with Lmer we need to store the evaluation data in a pandas dataFrame.

We are going to load a csv file that stores all the evaluation data of the experiment and adjust the data type for categorical variables:

In [3]:

eval_data = pd.read_csv('../significance_all_hyperpar/data_ter.csv')
eval_data = eval_data.astype({"sentence_id" : 'category', "system" : 'category'})

Let's take a quick look at a how a proper evaluation dataFrame should look like:

In [4]:

eval_data

Out[4]:

	sentence_id	system	replication	ter	src_length
0	0_0	Baseline	1	0.4375	29
1	0_0	Marking	1	0.4375	29
2	0_0	Marking	2	0.4375	29
3	0_0	Marking	3	0.4375	29
4	0_0	PostEdit	1	0.4375	29
...	...	...	...	...	...
7282	9_9	Marking	2	0.6000	16
7283	9_9	Marking	3	0.6000	16
7284	9_9	PostEdit	1	0.6000	16
7285	9_9	PostEdit	2	0.6000	16
7286	9_9	PostEdit	3	0.6000	16

7287 rows × 5 columns

The essential information is contained in the following columns:

sentence_id : test set sentence (input) identifier
system : system used to translate the sentence
ter : quality of translation measure

We can also see that the dataFrame stores the evaluation data of three replications for each system. This replications were generated by Kreutzer et al. by using three different initial random seeds --but keeping all other hyperparameters equal-- for training the Marking and PostEdit models. The identifier for the random seed is stored in the replication column. This column is only kept for the sake of completeness. The variable replication will not play an explicit role in the linear mixed model as you will see, nevertheless all evaluations are used to fit the linear mixed model. The last column contained in the dataFrame is src_length which stores the length (in number of tokens) of the input sequence.

Remark: It is not necessary to train (or include data from) replications of a model to conduct a model based significance test. The procedure can also be applied to compare only the best model instances found. But it could also be applied (in exactly the same way) when you want to compare sets of model instances trained under different hyperparameter configurations (even if the models have different hyperparameters). You can control what you compare by including the evaluation results of the models you want to include in the dataFrame and remove the results for those models you don't want to include in the comparison.

System Comparison using LMEMs: Account for Repeated Measurements in the Data¶

As mentioned in the introduction we have three competing systems (baseline, marking, postEdit) which we want to compare. The first question we would like to answer "Is there any statistically significant difference between them?". To apply the GLRT (Algorithm 4.4 , page 106) we need to fit two LMEMs to the data. We use pymer4 to instantiate and fit these models.

We instantiate a LMEM by calling Lmer with two essential arguments:

formula expects a string representing an abstract description of the model we want to instantiate. The description provided by this string is used to construct the design matrices of the LMEM. The general form of this string for our use case is "evaluation_metric ~ system_identifier + (1 | input_identifier)".
data expects the dataFrame that stores the evaluations we want to analyze.

In [5]:

differentMeans_model = Lmer(formula = "ter ~ system + (1 | sentence_id)", data = eval_data)

Now that we have created an instance we need to fit the model to the data. We do this by calling the fit() method.

fit() has several arguments, but we use only three:

factors allows us to specify the level names of the the system_identifier. In our case system has three levels ("Baseline", "Marking" and "PostEdit").
REML allows us to choose between the REML estimator (value is True) or the ML estimator (value set to False). Because we want to conduct a GLRT we have to use the ML estimator, thus set REML to False.
summarize: We set it to False because we want fit to run silently without reporting the final results.

In [10]:

differentMeans_model.fit(factors = {"system" : ["Baseline", "Marking", "PostEdit"]}, REML = False, summarize = False)

The inclusion of system_identifier in the model description reflects the hypothesis that the expected TER scores are different for at least two systems. In order to test this hypothesis we need to compare differentMeans_model with a model that reflects the assumption that expected TER scores are equal for all systems. To fit such a model we repeat the above steps. But this time the formula describing the model does not contain the system_identifier:

In [74]:

commonMean_model = Lmer(formula = "ter ~ (1 | sentence_id)", data = eval_data)
commonMean_model.fit(REML = False, summarize = False)

Next we define a function that does the calculations needed to perform a GLRT.

In [21]:

def GLRT(mod1, mod2):
    
    chi_square = 2 * abs(mod1.logLike - mod2.logLike)
    delta_params = abs(len(mod1.coefs) - len(mod2.coefs)) 
    
    return {"chi_square" : chi_square, "df": delta_params, "p" : 1 - stats.chi2.cdf(chi_square, df=delta_params)}

Then we perform the test:

In [75]:

GLRT(differentMeans_model, commonMean_model)

Out[75]:

{'chi_square': 34.08951047763003, 'df': 2, 'p': 3.958738858944599e-08}

The significant p-value (less than the alpha level of 0.05) lets us conclude that the differentMeans_model fits the data better than the commonMean_model. Consequently, we reject the hypothesis that three models perform equally well.

In the next step we refine our analysis to make a more specific statement about system performance differences. Therefore we use so called contrasts (in our case pairwise comparisons) and perform a so called post-hoc analysis. pymer4 has implemented the relevant techniques in the post_hoc() method.

post_hoc() has several arguments (full list) but currently we only need marginal_vars which lists the factors for which we want know the level means. In our case we want to know the estimates of the expected TER score for "Baseline", "Marking" and "PostEdit" (the levels of system).

In [19]:

post_hoc_results = differentMeans_model.post_hoc(marginal_vars = ["system"])

P-values adjusted by tukey method for family of 3 estimates

post_hoc() returns a list of two dataFrames.

The first stores cell mean estimates (plus confidence intervals and standard errors):

In [20]:

post_hoc_results[0] #cell (group) means

Out[20]:

	system	Estimate	2.5_ci	97.5_ci	SE	DF
0	Baseline	0.591	0.575	0.607	0.008	1139.270
1	Marking	0.578	0.563	0.594	0.008	1062.523
2	PostEdit	0.581	0.566	0.597	0.008	1062.523

The second one stores all pairwise comparisons (including p-values) of these level means:

In [21]:

post_hoc_results[1] #contrasts (group differences)

Out[21]:

	Contrast	Estimate	2.5_ci	97.5_ci	SE	DF	T-stat	P-val	Sig
0	Baseline - Marking	0.012	0.007	0.017	0.002	6246.0	5.841	0.000	***
1	Baseline - PostEdit	0.010	0.005	0.015	0.002	6246.0	4.553	0.000	***
2	Marking - PostEdit	-0.003	-0.006	0.001	0.001	6246.0	-1.822	0.163

The row 0 records the statistics of the comparison between the Baseline and the Marking system, 1 those of Baseline vs PostEdit and 2 Marking vs PostEdit. The column Estimate records the estimated mean difference between the compared systems and P-val stores the p-value.

In summary we see that both feedback procedures improved the baseline system, but none of them seems to be more effective than the other.

Expand the Analysis: Condition on Input Properties¶

We start by looking at a panel of descriptive plots that show the dependency of the estimated expected TER score on the source sentence length for each system. Therefore we create a scatterplot for each system and add a non-parametric smoother to get an idea of the the expected TER score conditional on the source sentence length $\mathbb{E}[\text{TER}|\text{source sentence length}]$ for each system. Such a plot can easily be generated by the visualization package of your choice based on eval_data.

$kreutzer_sig_fig2-1.png$

Looking at the none-parametric smoothers (the lines) we can see a qualitatively similar non-linear dependency of the expected TER score conditional on source sentence length $\mathbb{E}[\text{TER}|\text{source sentence length}]$ for all systems. In principle the automatic translation of longer sentences yields worser (for TER smaller is better) results than for shorter sentences. But a closer look shows that the steepness of this dependency is different for each system.

In order to analyze this (so far only descriptive) difference we bin the source sentence length in three classes (short, typical and very long) for further assessment:

In [ ]:

eval_data = eval_data.assign(src_length_class = lambda x: pd.cut(x.src_length, bins=[np.min(x.src_length), 15, 55, np.max(x.src_length)], labels=["short", "typical", "very long"], include_lowest=True))

differentMeans_model is a model for $\mathbb{E}[\text{TER}|\text{system}]$. We can simply expand its specification to add a conditionality for source sentence length to analyze if $\mathbb{E}[\text{TER}|\text{source sentence length}]$ is uniform for all systems or not.

We do this by also including the terms src_length_class and system:src_length_class to the formula for differentMeans_model thereby creating a model for $\mathbb{E}[\text{TER}|\text{source sentence length, system}]$:

In [97]:

model_expanded = Lmer("ter ~ system + src_length_class  + system:src_length_class + (1 | sentence_id)", data = eval_data)
model_expanded.fit(factors = {"system" : ["Baseline", "Marking", "PostEdit"], "src_length_class" : ["short", "typical", "very long"]}, REML = False, summarize=False)

system:src_length_class is a so called interaction term. This term allows model_expanded to model a different $\mathbb{E}[\text{TER}|\text{source sentence length}]$ for each system.

We want to test the hypothesis that $\mathbb{E}[\text{TER}|\text{source sentence length}]$ is at least different for two systems, so we have to compare it with a model that represents the complementary hypothesis that $\mathbb{E}[\text{TER}|\text{source sentence length}]$ is identical for all systems. The corresponding model can be obtained by removing the interaction from the model description of model_expanded:

In [98]:

model_nointeraction = Lmer("ter ~ system + src_length_class + (1 | sentence_id)", data = eval_data)
model_nointeraction.fit(factors = {"system" : ["Baseline", "Marking", "PostEdit"], "src_length_class" : ["short", "typical", "very long"]}, REML = False, summarize=False)

Next we perform a GLRT to receive a p-value:

In [99]:

GLRT(model_expanded, model_nointeraction) # test interaction

Out[99]:

{'chi_square': 32.43479533648133, 'df': 4, 'p': 1.558982328231373e-06}

Again the p-value is less than .05 and thus we conclude that $\mathbb{E}[\text{TER}|\text{source sentence length}]$ are not equal for all systems.

The easiest way to interpret a significant interaction is to plot the involved means and standard errors which can be calculated by calling post_hoc() with the additional grouping_vars argument providing which variable is used to group the means on the expanded model.

In [100]:

post_hoc_results = model_expanded.post_hoc(marginal_vars = "system", grouping_vars = "src_length_class")

P-values adjusted by tukey method for family of 3 estimates

In [106]:

post_hoc_results[0]

Out[106]:

	system	src_length_class	Estimate	2.5_ci	97.5_ci	SE	DF
0	Baseline	short	0.554	0.531	0.577	0.012	1141.571
1	Marking	short	0.538	0.515	0.561	0.012	1063.020
2	PostEdit	short	0.541	0.519	0.564	0.012	1063.020
3	Baseline	typical	0.614	0.593	0.635	0.011	1141.571
4	Marking	typical	0.606	0.585	0.627	0.011	1063.020
5	PostEdit	typical	0.609	0.588	0.630	0.011	1063.020
6	Baseline	very long	0.891	0.752	1.031	0.071	1141.571
7	Marking	very long	0.818	0.681	0.955	0.070	1063.020
8	PostEdit	very long	0.783	0.646	0.921	0.070	1063.020

Using the visualization package of your choice you can create an interaction plot by plotting the estimated means:

Obviously we can see that for all systems in general the translation of shorter sentences work better than for longer once. But most interesting, we can see that the performance difference between systems depends on the source sentence length. We see only very small gains (for TER smaller is better) for typical long sentences, and larger but still small gains for short ones. The largest gains can be observed for very long sentences. Additionally, for short and typical long sentences the marking improved system seems to work best, but for very long sentences the postEdit augmented system seems to outperform all competitors. These observations are only descriptive, but we can employ the machinery of the previous section to conduct statistical inference.

In order to assess the pairwise differences conditional on the sentence length we look at the second output table of the post_hoc call. For ease of presentation we slice the table so that we only see the lines that belong to a particular source sentence length:

In [103]:

post_hoc_results[1].query("src_length_class == 'short'")

Out[103]:

	Contrast	src_length_class	Estimate	2.5_ci	97.5_ci	SE	DF	T-stat	P-val	Sig
0	Baseline - Marking	short	0.016	0.009	0.024	0.003	6246.0	5.113	0.000	***
1	Baseline - PostEdit	short	0.013	0.005	0.020	0.003	6246.0	4.018	0.000	***
2	Marking - PostEdit	short	-0.003	-0.009	0.002	0.002	6246.0	-1.549	0.268

For short sentences both feedback finetuned systems show significant improvements over the baseline model, but the performance difference between the finetuned systems is not significant.

In [104]:

post_hoc_results[1].query("src_length_class == 'typical'")

Out[104]:

	Contrast	src_length_class	Estimate	2.5_ci	97.5_ci	SE	DF	T-stat	P-val	Sig
3	Baseline - Marking	typical	0.008	0.001	0.015	0.003	6246.0	2.729	0.018	*
4	Baseline - PostEdit	typical	0.005	-0.002	0.012	0.003	6246.0	1.689	0.210
5	Marking - PostEdit	typical	-0.003	-0.008	0.002	0.002	6246.0	-1.471	0.305

For typical long sentences we only have strong and clear enough empirical evidence to conclude that the marking based feedback improves the baseline model. All other pairwise comparisons are not significant which means that the observed empirical evidence is not strong and clear enough to rule out a chance result.

In [105]:

post_hoc_results[1].query("src_length_class == 'very long'")

Out[105]:

	Contrast	src_length_class	Estimate	2.5_ci	97.5_ci	SE	DF	T-stat	P-val	Sig
6	Baseline - Marking	very long	0.073	0.029	0.117	0.019	6246.0	3.863	0.000	***
7	Baseline - PostEdit	very long	0.108	0.063	0.152	0.019	6246.0	5.703	0.000	***
8	Marking - PostEdit	very long	0.035	0.003	0.066	0.013	6246.0	2.602	0.025	*

Compared to short and typical long sentences the improvements for very long sentences are a roughly a magnitude larger. We observe significant improvements over the baseline for both feedback methods and also a significantly larger improvement for post edit feedback compared to markings.

Contrary to the unconditional analysis where we have found no significant difference between Marking and PostEdit the conditional analysis has shown that post edit feedback is advantageous over marking but only for very long sentences but we haven't observed clear evidence to conclude that post edit feedback improve the baseline for typical long sentences. Based on the findings of this analysis we can make a nuanced recommendation when to use markings or post edits to improve a baseline system. If the data that a system will process is mainly composed of short or typical long sentences the machine learning practitioner should use markings to improve the baseline system. On the other hand if the data contains a large fraction of very long sentences or if the machine learning practitioner wants to especially improve an them, then he needs to fine-tune the baseline model based on post edits.