Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propagate necessary weights for estimating MC closure #27

Open
ktht opened this issue Sep 29, 2022 · 13 comments
Open

Propagate necessary weights for estimating MC closure #27

ktht opened this issue Sep 29, 2022 · 13 comments

Comments

@ktht
Copy link
Member

ktht commented Sep 29, 2022

At the moment we include the FR weight in the nominal event weight. However, when estimating the contribution of MC in fake closure regions where leptons of one flavor are kept tight while the leptons of opposite flavor are relaxed to fakeable but not tight, then it means that we have to apply the FR weights to only those events that conform to the latter case.

However, I believe the easiest way to implement it is to factor out the FR weights that correspond to the flavor of tight leptons at the analysis level. In other words, we need to have two more additional branches: FR weights for electrons (eg frWeight_e) and FR weights for muons (frWeight_m). When we estimate the MC closure contribution for electrons and muons, we just divide the nominal event weight with the FR weight of muons and electrons, respectively, and fill the histograms.

@saswatinandan
Copy link
Collaborator

Fake_Rate weight is considered in the evtweight calculation here and here all leptons are looped over and fake weight is estimated only for those leptons which fail tight selection and for those passing tight selection it is 1. So I don't think we need any additional branch for Fake_Rate weight.

@ktht
Copy link
Member Author

ktht commented Oct 11, 2022 via email

@saswatinandan
Copy link
Collaborator

Suppose we are in electron,muon channel and we want to consider electron closure shape correction, then we will consider those events only with fakeable electron but not tight and tight muon. We can check it from the branch isTight and isFakeable branch and lepton flavour can be checked from pdgId. And fake weights are obtained from here where fake weight is already considered. Isn't it sufficient or something is missing.

@ktht
Copy link
Member Author

ktht commented Oct 11, 2022 via email

@veelken
Copy link
Contributor

veelken commented Oct 11, 2022

Hi,

I think Saswati's idea to handle the MC closure via custom event selection strings is a good idea.
I think we actually need custom event selection strings anyway, Karl.

If I recall correctly, we compute the fake MC closure systematics separately for electrons (Clos_e) and muons (Clos_m). When we compute the Clos_e systematic, we relax the electron selection to fakeable, while keeping the muon selection tight, and apply the FR weights to the fakeable electron. So, I believe we anyway need a custom event selection string to relaxe the lepton selection from tight to fakeable only for electrons and not for muons. The FR weights are already correctly included in the evtWeight (except that we need to switch from FR measured in data to FR obtained in MC). A similar reasoning applies for computing the Clos_m sytematic.

Or am I missing something ?

@ktht
Copy link
Member Author

ktht commented Oct 11, 2022 via email

@veelken
Copy link
Contributor

veelken commented Oct 11, 2022

Hi,

I have an alternative proposal. We could add 2 flags:

  • isClos_e = isTight || (is_electron && isFakeable)
  • isClos_mu = isTight || (is_muon && isFakeable)

to the RecoLepton class and 2 corresponding branches to the "plain" Ntuple (this would be an addition to the C++ code, which would work in the same way for all channels, as far as I can see).

These extra branches would allow to simply

  • replace "lep1_isTight" -> "lep1_isClos_e" and "lep2_isTight" -> "lep2_isClos_e"
  • replace "ntightlep = 2" -> "ntightlep <= 1"

in the event selection string when running the analysis code to compute the Clos_e systematics.
And to make a similar replacement when running on the Clos_m systematic.

I think this would work (and would require the most minimal amount of coding to replace the event selection string).

What do you think ?

@saswatinandan
Copy link
Collaborator

AFAICT you're both correct. I did not consider the case where the event has both fake electrons and muons but neither of which are tight -- such events end up in fake AR but not in MC closure -- so the selection string has to differ wrt the fake AR. If that weren't the case, then it'd have been easier to just save the extra weights imo. Thus, we need a python function that generates: * "(lep1_isTight && lep2_isTight && ... && lepN_isTight)" for the SR; * "(lep1_isFake && lep2_isFake && ... && lepN_isFake) && ! (lep1_isTight && lep2_isTight && ... && lepN_isTight)" for the fake AR; * Permutations of "(lepA_isFake && ! lepA_isTight && (lepA_pdgId == 11 || lepA_pdgId == -11)" and "lepB_isTight && (lepB_pdgId == 13 || lepB_pdgId == -13)" to get MC closure for electrons (and swap 11 and 13 to get MC closure for muons); when creating cfg files for analysis jobs and apply the nominal event weight in all cases. It could take the number of leptons and the type of analysis region (SR, fake AR, MC closure for e/mu) as input and return one of those strings given above. For taus we can have a separate function and ignore the PDG ID info. I think it can be implemented in our job distribution framework (right?). Or what do you think?

On Tue, Oct 11, 2022, 5:20 PM Christian Veelken @.> wrote: Hi, I think Saswati's idea to handle the MC closure via custom event selection strings is a good idea. I think we actually need custom event selection strings anyway, Karl. If I recall correctly, we compute the fake MC closure systematics separately for electrons (Clos_e) and muons (Clos_m). When we compute the Clos_e systematic, we relax the electron selection to fakeable, while keeping the muon selection tight, and apply the FR weights to the fakeable electron. So, I believe we anyway need a custom event selection string to relaxe the lepton selection from tight to fakeable only for electrons and not for muons. The FR weights are already correctly included in the evtWeight (except that we need to switch from FR measured in data to FR obtained in MC). A similar reasoning applies for computing the Clos_m sytematic. Or am I missing something ? — Reply to this email directly, view it on GitHub <#27 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPR6EA5JBVXI54BFNRPPLLWCVZTTANCNFSM6AAAAAAQY4FWN4 . You are receiving this because you authored the thread.Message ID: @.>

I guess for application region, minimum one lepton should be fakeable and not tight and maximum all leptons can be fakeable but not tight i.e.

  • "(lep1_isFake && lep2_isFake && ... && lepN_isFake) && ! (lep1_isTight && lep2_isTight && ... && lepN_isTight)" for the fake AR; ->
    "(!lep1_isTight && lep2_isTight)||(lep1_isTight && !lep2_isTight) || (!lep1_isTight && !lep2_isTight)" AR.

@ktht
Copy link
Member Author

ktht commented Oct 11, 2022

It's the same thing.

@saswatinandan
Copy link
Collaborator

Right. Then we actually don't need this (lep1_isFake && lep2_isFake && ... && lepN_isFake) as all leptons are fakebale https://github.com/HEP-KBFI/TallinnNtupleProducer/blob/main/Writers/plugins/RecoLeptonWriter.cc#L132 so only
"! (lep1_isTight && lep2_isTight && ... && lepN_isTight)" this is sufficient for AR

@ktht
Copy link
Member Author

ktht commented Oct 12, 2022

Hi,

I have an alternative proposal. We could add 2 flags:

  • isClos_e = isTight || (is_electron && isFakeable)
  • isClos_mu = isTight || (is_muon && isFakeable)

to the RecoLepton class and 2 corresponding branches to the "plain" Ntuple (this would be an addition to the C++ code, which would work in the same way for all channels, as far as I can see).

These extra branches would allow to simply

  • replace "lep1_isTight" -> "lep1_isClos_e" and "lep2_isTight" -> "lep2_isClos_e"
  • replace "ntightlep = 2" -> "ntightlep <= 1"

in the event selection string when running the analysis code to compute the Clos_e systematics. And to make a similar replacement when running on the Clos_m systematic.

I think this would work (and would require the most minimal amount of coding to replace the event selection string).

What do you think ?

Hi Christian, shouldn't it be

  • isClos_e = isTight || (is_electron && isFakeable && ! isTight)
  • isClos_mu = isTight || (is_muon && isFakeable && ! isTight)

?

In any case, the code that produces the selection string would be the same at the end of the day, ie it doesn't matter if we concatenate "isTight" flags with "&&" delimiters to obtain event in the SR, or "isClose_e" / "isTight || (is_electron && isFakeable && ! isTight)" flags to get events in the Clos_e region. It basically comes down to whether or not we want to keep the number of branches to an absolute minimum or make the selection string slightly more readable. I have no preference here.

Saswati, I see no point in being so nitpicky. If we want to relax the lepton selection for the purpose of training an ML model for example, then we would probably need to save loose leptons instead of fakeable ones.

@veelken
Copy link
Contributor

veelken commented Oct 12, 2022

Hi Karl,

I think

  • isClos_e = isTight || (is_electron && isFakeable && ! isTight)
  • isClos_mu = isTight || (is_muon && isFakeable && ! isTight)

is wrong. My understanding is that at least one of the leptons must not pass the tight lepton selection (to avoid overlap with the signal region). It is wrong to demand that all leptons fail the tight selection. Think about an event with 2 electrons in the 2lss channel. For this event, the isClos_m systematic requires that both electrons pass the tight selection, while the isClos_e systematic allows for either 2 fakeable or 1 fakeable + 1 tight electron, mimicking the selection that we apply in the fake background control region for the real data.

I realized that I made a mistake too: It is wrong to replace "ntightlep = 2" -> "ntightlep <= 1" for Clos_e as well as for Clos_m.
Instead, we need to:

  • for Clos_e: replace "ntightlep = 2" -> "(ntightlep <= 1 || nelectrons = 0)"
  • for Clos_m: replace "ntightlep = 2" -> "(ntightlep <= 1 || nmuons = 0)"

in order to reproduce the behaviour in
https://github.com/HEP-KBFI/tth-htt/blob/master/bin/analyze_2lss.cc#L1737-L1739

@ktht
Copy link
Member Author

ktht commented Oct 17, 2022

So I now had more time to focus on this topic. Sorry for the very long post but I don't see any other way than to explicitly spell it all out.

I almost agree with your proposal, Christian, if it weren't for this one caveat that we haven't discussed at all: the gen-matching status of the selected leptons and taus. The second line that you highlighted in your latest reply says that a given event is vetoed in MC closure region for electron/muons if all selected objects are tight but at least one of those objects is a non-prompt electron/muon. Note that the functions countFakeElectrons and countFakeMuons determine the number of electrons and muons that are matched to generator-level jets. Please also note that we only select those events in the MC closure regions that have at least one non-prompt lepton, which is accomplished by considering only those histograms that have "_fake" suffix in their name.

It follows that the proposed selection fails to consider 2lss events as coming from MC closure for electrons if a prompt electron and a non-prompt muon in the event both pass the tight cuts. In the past I've tried to argue for removal of such contributions from the MC closure region (see this thread and the PDF file linked within), but unfortunately it introduced too large discrepancy between MC closure and fakes MC. Thus, if we want to replicate the same prescription as before, we then need to replace the proposed "(ntightlep <= 1 || nmuons = 0)" and "(ntightlep <= 1 || nelectrons = 0)" cuts with other criteria (which I'll show below). Alternatively, we could measure fakeable-to-tight identification efficiencies and then figure out how to reconcile those with the fake factor method, or we could abandon the fake factor method altogether in favor of the full matrix approach that includes both fake rates and identification efficiencies (in low-multiplicity channels where inverting such matrix is mathematically valid).

For the sake of completeness I'll discuss all possible use-cases we might encounter in our analysis as to how the phase space can be partitioned based on the promptness and tightness criteria of selected leptons and taus. I'll focus only on the leptons in the following to keep things simple (mostly because taus have different gen-matching codes and no pdgId attributes). At the very end you'll find some tables that demonstrate how the partitioning is supposed to work.

First, we have data events, which do not have any generator-level information available for obvious reasons:

  • Signal region (SR) -- select only tight leptons with:
    lep1_isTight && lep2_isTight && .. && lepN_isTight
  • Fakes (constitues fakes application region) -- select only fakeable leptons, but avoid overlap with the SR, which is achieved with:
    (lep1_isFakeable && lep2_isFakeable && .. && lepN_isFakeable) &&
    ! (lep1_isTight && lep2_isTight && .. && lepN_isTight)
  • Flips -- same as SR, but with a different charge sum condition.
    For example in 2lSS, replace (lep1_charge*lep2_charge > 0) with (lep1_charge*lep2_charge < 0))

In MC we have several ways of dividing the phase space:

  • SR -- select only tight prompt (gen-match code equal to 1) leptons with:
    (lep1_isTight && lep2_isTight && .. && lepN_isTight) &&
    ((lep1_genMatch & 1) && (lep2_genMatch & 1) && .. && (lepN_genMatch & 1))
    • Note that we should replace the gen-matching condition (genMatch & 1) with (genMatch & 1 || genMatch & 2) here and in the following if we don't care about the flips
    • Also, if we consider matches of reconstructed leptons to generator-level hadronic taus as "prompt", then we should also include (genMatch & 4) as another possibility. It comes down to how the prompt lepton ID was trained and (more importantly) how the fake rates were derived
    • SR from different analyses or channels cannot overlap with each other, all other (irreducible background) regions can overlap with one another or with the SR from a different analysis/channel
  • Fakes MC (FMC) -- proxy to data fakes, but also used in comparison to MC closure in order to derive the corresponding MC closure uncertainties. The analysis region is defined by selecting only those events that have only tight leptons but at least one is considered as non-prompt (ie fake, gen-match code equal to 16):
    (lep1_isTight && lep2_isTight && .. && lepN_isTight) &&
    ((lep1_genMatch & 16) || (lep2_genMatch & 16) || .. || (lepN_genMatch & 16))
    • Note that fakes have the highest priority out of all irreducible backgrounds, which means that the event is still considered as "fake" if there are some flips or conversions but there's at least one fake
    • edit: I've come to the realization that maybe we can ignore this "priority" of irreducible backgrounds that we had implemented in the old FW (out of necessity?), and accept only those events to FMC if all leptons and taus are either prompt or non-prompt (with one non-prompt at minimum). This sounds more correct to me than what we did before because conversions or flips have nothing to do with the fake factor method
  • Flips MC -- proxy to data flips, which is achieved by requiring all selected lepton to pass the tight cuts and have the correct charge sum at the reconstruction level and therefore enter the SR, but with at least one selected electron to have a different charge compared to its generator-level match (gen-match code equal to 2). Extra care is needed to avoid contamination by fakes:
    (lep1_isTight && lep2_isTight && .. && lepN_isTight) &&
    ! ((lep1_genMatch & 1) && (lep2_genMatch & 1) && .. && (lepN_genMatch & 1)) &&
    (((lep1_pdgId == 11 && lep1_genMatch & 2) || lep1_genMatch & 1) || ((lep2_pdgId == 11 && lep2_genMatch & 2) || lep2_genMatch & 1) || .. || ((lepN_pdgId == 11 && lepN_genMatch & 2) || lepN_genMatch & 1))
  • Conversions -- SR events, where at least one of the selected electrons is matched to a generator-level photon. Extra care is needed to avoid contamination by fakes or flips:
    (lep1_isTight && lep2_isTight && .. && lepN_isTight) &&
    ! ((lep1_genMatch & 1) && (lep2_genMatch & 1) && .. && (lepN_genMatch & 1)) &&
    ((lep1_genMatch & 4 || lep1_genMatch & 1) || (lep2_genMatch & 4 || lep2_genMatch & 1) || .. || (lepN_genMatch & 4 || lepN_genMatch & 1))
  • Prompt MC (PMC) that is subtracted from data fakes estimate after both have been reweighted according to the fake factor method:
    (lep1_isFakeable && lep2_isFakeable && .. && lepN_isFakeable) &&
    ! (lep1_isTight && lep2_isTight && .. && lepN_isTight) &&
    ((lep1_genMatch & 1) && (lep2_genMatch & 1) && .. && (lepN_genMatch & 1))
  • MC closure for electrons/muons(/taus) (MCe/MCm(/MCt))
    • Needed to approximate FMC by requiring all of the following four conditions to be true:
      1. All electrons/muons(/taus) must pass fakeable selection criteria:
        (lep1_isFakeable && lep2_isFakeable && .. && lepN_isFakeable)
      2. All other objects of different flavor must pass tight selection criteria:
        ((lep1_isTight || lep1_pdgId == 11/13) && (lep2_isTight || lep2_pdgId == 11/13) && .. && (lepN_isTight || lepN_pdgId == 11/13))
      3. At least one selected object must be non-prompt:
        ((lep1_genMatch & 16) || (lep2_genMatch & 16) || .. || (lepN_genMatch & 16))
      4. If all selected objects pass the tight cuts, then the event is vetoed if at least one of those objects is a non-prompt electron/muon(/tau):
        ! ((lep1_isTight && lep2_isTight && .. && lepN_isTight) &&
        ((lep1_pdgId == 11/13 && lep1_genMatch & 16) || (lep2_pdgId == 11/13 && lep2_genMatch & 16) || .. || (lepN_pdgId == 11/13 && lepN_genMatch & 16)))
    • The first two conditions can be concatenated to a single flag like Christian proposed, but I don't see a simple way of reducing the last two conditions to a binary value that can be attributed to individual leptons
    • The last condition induces overlap with the FMC region

In my opinion, the most efficient way to implement it all would be still at the analysis level. In this case we can drop the redundant isFake, isFlip or isClos_e/m branches, since there is no clear-cut way around creating the long event selection string regardless. Plus, it gives more flexibility for specifying the gen-matching conditions and reduces the Ntuple size. I think the piece of code that does all that can be moved to the framework that generates the config files. It can be generalized such that all these details above are hidden from the end-user.

Another issue is that we have used different (ie QCD-driven) fake rates (FR) in MC regions compared to FMC or data fakes, where we use data-driven FR. If we want to support this feature (which I think is the case because it allows to encapsulate potential differences in flavor composition between the measurement and application regions), then the easiest way to accomplish it is to:

  • use data-driven FR in the nominal event weight;
  • save data-driven and QCD-driven FR in separate branches;
  • when computing yields for the MC closure, undo data-driven FR and apply QCD-driven FR.

edit: forgot to explain what each letter in the following tables stand for:

  • upper left corner tells the channel by lepton flavor (e for electron, m for muon, t for tau)
  • columns tell which of the leptons pass the fakeable or tight definition, eg FTF means that the 1st and 3rd object pass the fakeable but fail the tight criteria, and the 2nd object passes the tight requirements
  • rows tell the gen-matching status of each object, for example NNP means that the first two objects are non-prompt, while the 3rd one is a prompt object

Single lepton channel:

e F T
N MCe FMC, MCm
P PMC SR
m F T
N MCm FMC, MCe
P PMC SR

Dilepton channel:

ee FF FT TF TT
NN MCe MCe MCe FMC, MCm
NP MCe MCe MCe FMC, MCm
PN MCe MCe MCe FMC, MCm
PP PMC PMC PMC SR
em FF FT TF TT
NN MCe MCm FMC
NP MCe MCm FMC, MCm
PN MCe MCm FMC, MCe
PP PMC PMC PMC SR
mm FF FT TF TT
NN MCm MCm MCm FMC, MCe
NP MCm MCm MCm FMC, MCe
PN MCm MCm MCm FMC, MCe
PP PMC PMC PMC SR

2l+1tau channel (assuming that taus are required to be gen-matched in the SR):

eet FFF FFT FTF TFF FTT TFT TTF TTT
NNN MCe MCe MCe MCt FMC, MCm
NNP MCe MCe MCe MCt FMC, MCm, MCt
NPN MCe MCe MCe MCt FMC, MCm
PNN MCe MCe MCe MCt FMC, MCm
NPP MCe MCe MCe MCt FMC, MCm, MCt
PNP MCe MCe MCe MCt FMC, MCm, MCt
PPN MCe MCe MCe MCt FMC, MCm, MCe
PPP PMC PMC PMC PMC PMC PMC PMC SR
emt FFF FFT FTF TFF FTT TFT TTF TTT
NNN MCe MCm MCt FMC
NNP MCe MCm MCt FMC, MCt
NPN MCe MCm MCt FMC, MCm
PNN MCe MCm MCt FMC, MCe
NPP MCe MCm MCt FMC, MCm, MCt
PNP MCe MCm MCt FMC, MCe, MCt
PPN MCe MCm MCt FMC, MCe, MCm
PPP PMC PMC PMC PMC PMC PMC PMC SR
mmt FFF FFT FTF TFF FTT TFT TTF TTT
NNN MCm MCm MCm MCt FMC, MCe
NNP MCm MCm MCm MCt FMC, MCe, MCt
NPN MCm MCm MCm MCt FMC, MCe
PNN MCm MCm MCm MCt FMC, MCe
NPP MCm MCm MCm MCt FMC, MCe, MCt
PNP MCm MCm MCm MCt FMC, MCe, MCt
PPN MCm MCm MCm MCt FMC, MCe, MCm
PPP PMC PMC PMC PMC PMC PMC PMC SR

Of course, if taus are not required to be gen-matched (and data/MC SF is applied instead, as was the case with 2lss+1tau and 3l+1tau channels in ttH multilepton analysis), then the phase space partitioning is identical to the dilepton case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants