In the previous section we saw a part of the
identities table, which stores the repo identities found, and their relationship with unique identities (
|0ef1c4a||Jesus M. Gonzalez-Barahonaemail@example.com||NULL||git||0ef1c4a|
|3ca4e85||Daniel Izquierdo Cortazarfirstname.lastname@example.org||NULL||git||3ca4e85|
|7ad0031||Alvaro del Castilloemail@example.com||NULL||git||7ad0031|
|9aed245||Alvaro del Castillofirstname.lastname@example.org||NULL||git||9aed245|
It is obvious that there are some repo identities in it that correspond to the same person. In SortingHat, that means they should point to the same unique identity (
uuid). But that is not the case, because in fact we have never told SortingHat they correspond to the same person. That operation is called “merging”.
For example, let’s merge repo identity
4fcec5a (dpose, email@example.com) with
5b358fc (dpose, firstname.lastname@example.org), which I know correspond to the same person:
(gl) $ sortinghat -u user -p XXX -d shdb merge \ 4fcec5a968246d8342e4acfceb9174531c8545c1 5b358fc11019cf2c03ea4c162009e89715e590dd Unique identity 4fcec5a968246d8342e4acfceb9174531c8545c1 merged on 5b358fc11019cf2c03ea4c162009e89715e590dd
Notice that we had to use the complete hashes (in the table above, and in the listing in the previous section, we shortened them just for readability). What we have done is to merge
5b358fc, and the result is:
$ mysql -u user -pXXX -e 'SELECT * FROM identities WHERE uuid LIKE "5b358fc%";' shdb | id | name | email | username | source | uuid | | 4fcec5a | dpose | email@example.com | NULL | git | 5b358fc | | 5b358fc | dpose | firstname.lastname@example.org | NULL | git | 5b358fc |
The query looked for all rows in the
identities table whose
uuid field starts with
5b358fc, and found two: the repo identifier
5b358fc, which was already linked to it, but also
4fcec5a, which we just merged onto it. This way we have learnt that when we merge, we merge a repo identity (the first argument) on a unique identity (the second argument).
We can follow this procedure for other identities that correspond to the same person: (Quan Zhou, email@example.com) and (quan, firstname.lastname@example.org); (Alberto Martín, email@example.com) and (Alberto Martín, firstname.lastname@example.org); and (Alvaro del Castillo, email@example.com) and (Alvaro del Castillo, firstname.lastname@example.org):
(gl) $ sortinghat -u user -p XXX -d shdb merge \ 0cac4ef12631d5b0ef2fa27ef09729b45d7a68c1 11cc0348b60711cdee515286e394c961388230ab Unique identity 0cac4ef12631d5b0ef2fa27ef09729b45d7a68c1 merged on 11cc0348b60711cdee515286e394c961388230ab (gl) $ sortinghat -u user -p XXX -d shdb merge \ 35c0421704928bcbe3a0d9a4de1d79f9590ccaa9 37a8187909592a7b78559399105f6b5404af9e4e Unique identity 35c0421704928bcbe3a0d9a4de1d79f9590ccaa9 merged on 37a8187909592a7b78559399105f6b5404af9e4e (gl) $ sortinghat -u user -p XXX -d shdb merge \ 7ad0031fa2db40a5149f54dfc2ec2a355e9443cd 9aed245d9df109f8d00ca0e656121c3bdde46a2a Unique identity 7ad0031fa2db40a5149f54dfc2ec2a355e9443cd merged on 9aed245d9df109f8d00ca0e656121c3bdde46a2a
Showing information about a unique identity
Now, we can check how SortingHat is storing information about these merged identities, but instead of querying directly the database, we can just use
(gl) $ sortinghat -u user -p XXX -d shdb show \ 11cc0348b60711cdee515286e394c961388230ab unique identity 11cc0348b60711cdee515286e394c961388230ab Profile: * Name: quan * E-Mail: email@example.com * Bot: No * Country: - Identities: 0cac4ef12631d5b0ef2fa27ef09729b45d7a68c1 Quan Zhou firstname.lastname@example.org - git 11cc0348b60711cdee515286e394c961388230ab quan email@example.com - git No enrollments
That is, we have two repo identities for this person, which we’re identifying in the profile as
quan, with email address
firstname.lastname@example.org. Remember that the profile was already produced when each repo identity was added to the database, by creating a unique identity for it, and using the data in the repo identity for the profile.
We merged the repo identity (Quan Zhou, email@example.com) on the unique identity that corresponded to (quan, firstname.lastname@example.org), which had as profile (quan, email@example.com) as well. Therefore, the unique identity after the merge conserves the old profile, in this case (quan, firstname.lastname@example.org). Should we have merged the other way around, we would have considered (Quan Zhou, email@example.com) as the profile, which in this case seems more convenient. So, we have to be careful about how to merge, if we want to conserve the most interesting profiles.
Unfortunately, we cannot redo the merge with the most convenient order:
(gl) $ sortinghat -u user -p XXX -d shdb merge \ 11cc0348b60711cdee515286e394c961388230ab 0cac4ef12631d5b0ef2fa27ef09729b45d7a68c1 Error: 0cac4ef12631d5b0ef2fa27ef09729b45d7a68c1 not found in the registry
0cac4ef is no longer a valid unique identifier: we lost it when we merged the repo identity
0cac4ef, which was the only one pointing to it. So, we cannot merge any repo identifier on it, since it no longer exists.
Later on we will revisit this case, since there are stuff that can be done: breaking the unique identifier into two. But for now, let’s use another approach to solve this problem.
We can just modify the profile for the unique identity, thus changing the profile for a person:
(gl) $ sortinghat -u user -p XXX -d shdb profile \ --name "Quan Zhou" --email "firstname.lastname@example.org" \ 11cc0348b60711cdee515286e394c961388230ab unique identity 11cc0348b60711cdee515286e394c961388230ab Profile: * Name: Quan Zhou * E-Mail: email@example.com * Bot: No * Country: -
This way we have the name and email address we want. Using
--country we can also set a country for the person.
Regenerating enriched indexes
When we interact with SortingHat, it only changes the contents of the database it manages. Therefore, the changes are not reflected in the indexes we have in ElasticSearch, nor in the dashboard.
To make changes appear in the dashboard, we need to create new enriched indexes (re-enrich the indexes). We can do that by removing raw and enriched indexes from ElasticsSearch, and then running the same
p2o.py commands shown to produce new raw and enriched indexes. But in our case, this is a clear overkill: we don’t need to retrieve new raw indexes from the repositories, since they are fine. We only need to produce new enriched indexes. For that, we can run
p2o.py as follows:
(gl) $ p2o.py --only-enrich --index git_raw --index-enrich git \ -e http://localhost:9200 --no_inc --debug \ --db-host localhost --db-sortinghat shdb --db-user user --db-password XXX \ git https://github.com/grimoirelab/GrimoireELK.git
I only used the
--only-enrich option, instead of
--enrich. This option will not try to retrieve data from the data sources, and then enrich it. It will just use the specified raw index to create the enriched one.
In this case, the command will create a new
git index (by modifying the current one) for items (commits) from the specified repository (
GrimoireELK). We need to run this command for all the repositories with identities affected by the changes. Once we have those new indexes, we’re done: the changes will be reflected in the dashboard.
Regenerating enriched indexes (take two)
The above method, even when it will work, is still an overkill. I really don’t need to modify the whole enriched indexes, by updating all the fields in their items. We just need to update the fields related to identities, which are the only ones that we need to change. For that, we have a specific option to
(gl) $ p2o.py --only-enrich --refresh-identities --index git_raw --index-enrich git \ -e http://localhost:9200 --no_inc --debug \ --db-host localhost --db-sortinghat shdb --db-user user --db-password XXX \ git https://github.com/grimoirelab/GrimoireELK.git
In addition to the
--only-enrich option, which was explained above, I have added the
--refresh-identities, which ensures that only identities, and not all fields in each item, are updated. And this was exactly what we needed. We still need to run the command for all the repositories affected, but that’s all.
Regenerating enriched indexes (take three)
In most cases, when the SortingHat database is modified, only a handful of identities are changed. That means that we don’t really need to change all the identities in enriched indexes. We can avoid this time-consuming operation (when the indexes are large) by changing only the items affected: those with changed identities.
In this case, the command to run is:
(gl) $ p2o.py --only-enrich --refresh-identities --index git_raw --index-enrich git \ --author_uuid 11cc0348b60711cdee515286e394c961388230ab \ 0cac4ef12631d5b0ef2fa27ef09729b45d7a68c1 \ -e http://localhost:9200 --no_inc --debug \ --db-host localhost --db-sortinghat shdb --db-user user --db-password XXX \ git https://github.com/grimoirelab/GrimoireELK.git
The magic is done by the
--author_uuid option, which will ensure that only items with that
author_uuid field are refreshed. In this case, we’re refreshing them for unique identities
0cac4ef, which were changed in the SortingHat database in the process above. Again, this should be done for all the repositories affected.
Something similar can be done with the
--author_id, but in this case for repo identities of the author.