Trust and Internet Identity Meeting Europe
2013 - 2020: Workshops and Unconference

Identity Correlation and Progressive User Profiles

(Radovan Semancik)

Session board photo

Typical usecase, one identity in my db, user and a second identity that is active directory account. In enterprise environment, they belong to same person, same ID, similar with government, dealing with government systems, situation should work the same.

Imagine a uni or system that is not strictly based on employee contract and do the same thing. Record of a student in uni and this is an id that is using from another org. How do I say if they belong to the same person or to a different one?

Question: how many of you are from academic community?

?: What we do is we have matching algo if the name is too common and there arent enough parameters around it will not be matched but sent into a special queue and there comes and employee account but they found the employees superior and asked has he been a student and asked for his matriculation number.

R: So a manual process? manual process -> telephone

? Sometimes its as easy as to look it up in the student register. There we can match like places of birth, another data point but this isn’t fed automatically into the system.

R: Do you do something smarter than manual?

? If the name is unique and if the name is long enough, it might be unique in the uni but it’s pretty unique in the general spectrum.

A bit of a list, so if someone has the surname Mueller it,s a function of length, long names are uncommon.

? You’re asking three questions, technical, business and data quality. If I got multiple systems in the record, what data do they have in common that I can look at even, national identifier or date of birth. In the US if you have an international student attending they won’t have a US student ID. It still reduces to data quality. Each of these is a great topic. The business processing is whether there is a manual process or says, I have an employee who is a student as well, it is sometimes delegated. We got some pointers to it. There are two things that we can start from, a rest API, calling systems, I got a lot of attributes about this person, the goal of that transaction is to say whether there is a unique identifier or not.

R: Yes no or?

?: Yes no maybe. No would be an error message. If I send a bunch of attributes, and they match, that is yes, if the person doesn’t exist that’s also a yes. I just need to know whether I got an identifier back. I ran some heuristic and you might be one of these three people. In those cases you can do, this is how you communicate, then you can either return cabinets or you can have an out of band mechanism that is close to the human process. At that point, the request has been queued. And the human comes in and resolves it. You can think of it as a batch process as well as an interactive process. Imagine in a batch process it doesn’t make sense to respond synchronously, in an interactive process yes.

The concept is that. Building out a match component. The protocol is called CipherIDmatch. The protocol says whether you have a match. You can configure rules based on the attributes. If I am doing a self-registration and this flows in the match flow and the match engine says that I am one of the people. In the case of a mess, you can generalize to two cases, one is don’t match and the new record, and when you do match and join the record, the operations would be to join and split, the protocol explains how to do that. from the match engine fixing, this is easy, in that case, you give me this reference identifier. The hard part is downstream, if you created an account, how do you split them? is there a way we can automate that? that seems scary.

R: Self-service merging?

B: If you have the attribute you can do that, two Gmail accounts then you could allow a selfless merge. It’s a bit out of the scope of the API right now, that’s the too on top of that. we haven’t seen a lot of that in particularly university cases you don’t have that attribute.

I haven’t seen any real data but it’s different.

R: How do you walk the user through the use-case?

B: Not much more complicated to ask: do you previously have an id or not? Probably we can trust that or sent a reset token for that. There’s a couple of implementations of it. Not only to the open-source component. the idea is to put in front your legacy switch out components and when you’re ready just point the new stuff to the match component. we’ve seen some implementation of that, I am not that sure.

Maybe merge with the scim team?

Possibly. some tech bits make sense and not make sense, scim has a much lager protocol than what you need.

R: What about manual processes? B A goal of getting 99% to be automated but there is a level of manual processing. Father and son with the same name but different birthdays. Or twins with the same name born on the same day. There’s a desire to not collect national identifiers anymore.

R: We want to limit the data and number of identifiers with GDPR. we realized this is the same person, what to do? Merge them? B: The concept of the match engine is to persist speed one thing you can say and to maintain it in the match engine, reference identifier and in the midpoint, you don’t track the national identifier, but any of these other identifiers. Then if you realize it the same person later, there’s a portion of the API that allows you to do this.

Rainer: The law specifically prohibits using the national identifiers, it doesn’t prohibit you to use derived identifiers. a hash of prohibited identifiers is perfectly legal. but if you need to rematch it’s very valuable.

How would the situation even occur? this use-case.

I am trying to explore the idea.

? But once it’s matched that’s it. But you wouldn’t reevaluate it if it’s in the queue for a quality process?

B: Initial match and don’t do rematching but you can imagine scenarios that you want to do it, if something is manually changed, maybe it should be rematched.

For the student email do you give the same email to them? the most common pattern is if the student comes back, they leave for a semester and come back, the most common pattern is to try to relink them to their old data, and to reactivate it. if you have a strong identifier otherwise it may cause a problem.

Does the campus allow the reuse?

B: About Conway hash, neither the API or the initial implementation define any of the attributes. You can hash the national identifiers, the downside is that, are you lose transposition checking.

Australians have a university national identifier. A very user oriented process.

R: This account starts with something really simple, anonymous account, university A, if this person wants to get a service, I may want more information like full name or email address to fill out before he gets access to the service. There needs to e a reverse process as well, to shrink it, if I don’t need an email, to be able to remove it from the profile.

Reversing can be tricky because if you need the info at some point again, you need to find a way to get the information back.

We’re encrypting passwords and then storing the password on the database and the key into the key store. If anyone attacks the IDM they get both.

Maybe storing it in tapes like some systems, this org is what they do is encrypt everything and having the key, by GDPR its legal that if you delete the key, you have deleted the data. Backups were another reason why you need to do that because you have logs that have user information. if you want to delete the user data you have to go through all backups and remove the stuff. That’s how you swipe iPhones, its the same concept.

For every value that is given, accountability is important, to know why you have it. Technically probably nobody does it. Machines might be able to do it. In some situations, it was requested because you needed some kind of service.