Interlinking Developer Identities Within and Across Open Source Projects: The Data

—Software developers use various software reposi- tories in order to interact with each other or to solve related problems. These repositories provide a rich source of informa- tion for a wide range of tasks. However, one issue to overcome in order to make this information useful is the identiﬁcation and interlinking of multiple identities of developers. In this paper, we propose a Linked Data-based methodology to interlink and integrate multiple identities of a developer found in different software repositories of a project as well as across repositories of multiple projects. By providing such interlinking will enable us to keep track of a developer’s activity not only within a single project but also across multiple projects. The methodology will be presented in general and applied to 5 Apache projects as a case study. Further, we show that the few methods suggested so far are not always appropriate to overcome the developer identiﬁcation problem.


I. INTRODUCTION AND MOTIVATION
In Software Engineering, many tools with underlying repositories have been introduced to support the collaboration and coordination in distributed software development. Research has shown that these software repositories contain rich amount of information about software projects. By mining the information contained in these software repositories, practitioners can depend less on their experience and more on the historical data [11]. However, software repositories are commonly used only as record-keeping repositories and rarely for design decision processes [10]. Examples of software repositories are [12]: source control repositories, bug repositories, archived communication etc.
Developers 1 use these repositories to interact with each other or to solve software-related problems. By extracting rich information from these repositories, one can guide decision processes in modern software development. For example, source-code and bugs are quite often discussed on bug repositories and project mailing lists. Data in these software repositories could be analyzed to extract bug and source-code related discussions, which could be linked to the actual bug description and source-code. This could allow keeping track of developers discussion related to a bug or source-code in different software repositories.
Developers are required to adopt an identity for each software repository they want to use. For example, they are required to adopt an email address in order to send an email to the project mailing list, adopt an ID 2 to push commits to the source control repository, adopt an ID to report bugs or post comments on a bug in bug repository etc. Often developers adopt different ID(s) for each software repository and sometimes multiple ID(s) for the same repository [21], while interacting with these software repositories in different context. The software repositories are designed to help developers collaborate and coordinate but lacks the feature to manage and merge multiple ID(s) of a developer specially in the case of mailing list and bug tracking repositories. Hence, a developer can register multiple ID(s) and use them to collaborate and coordinate with other developers. For example, Figure 1 shows a subset of the social network graph derived from the communication occurred among developers on the mailing list of an Apache project.
In the figure, we colored two nodes (i.e., red and blue) showing that both email ID(s) belong to a developer X 3 . It is quite clear that the developer X has used both email ID(s) to communicate with other developers on the mailing list. Looking into the social structure of graphs (based on both ID(s)), we see that there are few ID(s) which are common in both graphs but most of the ID(s) are part of different social graphs of developer X. In an ideal scenario, either both social graphs need to be merged into one graph or at-least there should exist some sort of information conveying that both ID(s) belong to developer X. Without any information provided about the different ID(s) used by a developer, a researcher who carries out social network analysis research [9], [19], [18]   developers appear as different entities. Furthermore, analysis of developer's activity [7] within a project might be difficult if the developer is using different and multiple ID(s) for each software repository.
In the specific case of open source software development, developers are often involved in multiple projects. One of the many reasons is that they see their contribution of time and efforts as a gift they provide to the project community. Other reasons could be a developer's ability to acquire skills and experience by participating in multiple open source projects. Hence the development activity of a developer is distributed not only within different software repositories of a project but also across different projects. It is worth mentioning that there exists an implicit connection between developer's activity within different software repositories of a project and also across different projects. Therefore, we require methods and techniques to correctly identify and relate the different ID(s) of developers not only within a project but also across different projects.
The contribution of this paper is manifolds: we propose a simple yet effective Linked Data-based approach to identify and interlink multiple ID(s) of a developer within different software repositories of a project as well as across software repositories of different projects. We show how interlinking helps in integrating developer's development activity which is distributed within different software repositories of a project due to the usage of multiple ID(s). Further, we show how the interlinking enables us to keep track of developer's activity across different projects. Further, we compare the matching results of our approach with other existing developer identification approaches.
The paper is structured as follows: in Section II we review related work and compare it to our approach. Research questions are outlined in section III. In Section IV, we present our approach to identify and interlink multiple ID(s) of a developer. We report our evaluation on 5 Apache projects in Section V followed by the discussion on the results and research questions in Section VI. Finally, we conclude in Section VII and outline future steps.

II. RELATED WORK
To the best of our knowledge, there are only a few published works on identifying and relating the different ID(s) that developers use to interact with software repositories of a project. In [6], Bird et al. proposed an approach to produce a list of <name,email> pairs by parsing the emails and clustering them. The clustering algorithm to measure the similarity between every pair of ID(s) is based on string similarity between names, between emails, between names and emails, etc. Two ID(s) with a similarity measure lying below a pre-defined threshold value are placed into the same cluster. The authors use different approaches to compute the similarity measures between every pair of ID(s), some of which we tested on our data set to validate the effectiveness of these approaches (cf. Section V). We also use the string similarity approach in order to interlink the ID(s) of a developer but our scope is broader in a sense that we are applying the heuristics not only within a single project or a repository but also across repositories of multiple projects.
In [21], Robles et al. discusses the problem of developer identification in general, but the work lacks in details about the heuristics they propose to identify and match the different ID(s) of developers. This makes it difficult to validate their approaches for solving this problem. The authors propose a technique to build one identity from another by extracting the "real life" name from email addresses, such as surname@domain.com, name.surname@domain.com etc. This is an approach based on pre-defined name schemes. Further, they try to match user names obtained from CVS to email addresses (excluding the domain after "@"). Their approach also relies on string similarity algorithm.
In [8], Megan et al. presents a methodology to integrate data about projects by way of matching projects (entities) based on certain heuristics. The heuristics include a simple scoring system for confidence in pairwise project matches. The author propose to interlink similar projects which are listed on multiple code forges. On the contrary, in this paper we focused on identification and interlinking multiple ID(s) of a developer.
In general, the problem is related to duplicate detection. The duplicate detection frameworks provide several techniques to effectively solve matching different entities. In this regard, Kopcke et al. [17] analyzed and compared 11 different duplicate detection frameworks. While research in this area mostly refers to identifying duplicates in the same data set, the techniques might be mapped to the case of matching over different data sets. However, they are tailormade for identifying different ID(s) of the same developer inside one repository. Naumann et al. [20] provides a nice overview of this research direction.
In the Semantic Web domain, Volz et al. [23] proposed an interlinking framework also known as SILK framework which generates link between two RDF data sets based on some string similarity measures specified by the user. Their interlinking framework supports different string similarity algorithms to compare if two different RDF resources are similar or not. In a next step, we will assess to what extent we can use the SILK framework to interlink developer identities in the near future.
III. RESEARCH QUESTIONS In the following, we list down few research questions which will be addressed later (Section VI) based on interlinking multiple ID(s) of a developer: 1) RQ-1: What is the added benefits of interlinking multiple ID(s) of a developer?
We will investigate if developers are using multiple ID(s) within a software repository and if interlinking them will provide any added benefits. We will validate this on the mailing list data set by computing the distinct nodes which are not shared among different social graphs of a developer using multiple ID(s). Further, we will compute the number of bugs reported by a developer using multiple ID(s). This would show us a clear picture on the amount of information which was not previously connected. 2) RQ-2: What is the ratio of developer's existence in multiple projects?
We will investigate the frequency of developer's participation in multiple projects. We will evaluate if developer's participation follow Pareto's law i.e., many developers participate in few projects and only few developers participate in many projects.
3) RQ-3: What are the contributions made by developers in multiple projects? As mentioned in Section I, developers often contribute to multiple open source projects. Therefore, we will evaluate to what extent our interlinking approach helps us in tracing developer's activity across multiple projects. In particular, we will investigate if developers are highly active in pushing source control commits, reporting bugs and participating in discussions across multiple projects.
IV. OUR APPROACH In this section, we describe our approach for identifying and interlinking different ID(s) of a developer found in different software repositories of a project as well as across repositories of multiple projects. We considered mailing list and bug repository data sources in this paper because developers often use multiple ID(s) on these software repositories as opposed to source control repository where access and accounts are controlled centrally.
With "Linked Data Driven Software Development" (LD2SD) [15], we have introduced a Linked Data-based methodology to relate and integrate data across software repositories explicitly and unambiguously. We propose to use Semantic Web technologies to represent data from different software repositories. As such, we propose to use RDF [16] (Resource Description Framework) as the core, target data model. Once modeled in RDF, the data can be easily integrated, indexed and queried using the SPARQL query 4 standard and associated tools. Finally, the integrated data can be published on the Web using Linked Data principles 5 allowing third parties to discover and subsequently crawl the knowledge, and also allowing to interlink with background or other related information available remotely on the Web. We refer the readers to [13] for details on how these standards would be used. Instead here we focus on the identification and interlinking of developer ID(s) in software repositories.
As mentioned by Bird et al., most emails contain header of the form [6]: For each email, we represent header information in RDF using FOAF 6 ontology as shown in Listing 1 (lines 10-12) 7 . We used email alias to build developer URI (i.e., a resource in RDF) because email alias can be used to distinguish between different developers. Further, the name and email address are defined as property values of that resource.
While converting mailing list data to RDF using our custom written scripts, we found certain developers who have used either full or part of their names in the email header. For example: From: "muhammad aftab iqbal" <ai@example.com> The above two email headers reveal multiple name(s) of the sender (i.e., aftab iqbal and muhammad aftab iqbal). The developer URI will remain the same in the case of above two exemplary email headers (i.e., http://srvgal85.deri.ie/linkedfloss/ai) but the property values will be changed (name in that case) After transforming the data into RDF, we loaded the RDF data sets into SPARQL endpoint 8 . The next step is to identify and interlink multiple ID(s) of a developer. From the Listings 1 and 2, we are able to conclude that the developer (aftab iqbal) has used different ID(s) (i.e., ai and aftab.iqbal) while interacting with others on the mailing list and bug tracking repository. We can interlink these two RDF fragments as shown in Listing 3 using an owl:sameAs property indicating that the developer URIs actually refer to the same entity. 8  By explicitly interconnecting ID(s) of a developer, we will be able to trace developer's activity on the mailing list as well as bug tracking repository. However, manually identifying and interlinking ID(s) of a developer is a time consuming task especially when dealing with large amount of data. Hence, one is required to devise an automatic approach for the identification and interlinking of developer ID(s). Therefore, in the following we will explain our simple yet effective developer matching approach.

A. Preliminaries
Let Authors be a list of distinct developer URIs found in the data set: and B = http : //srvgal85.deri.ie/linkedf loss/af tab.iqbal (3) LetÁ andB be the distinct <name,id> pairs for A and B respectively: < af tab iqbal, ai@example.com > < muhammad af tab iqbal, ai@example.com > (4) B = < af tab iqbal, af tab.iqbal@example.com > Let L be a set which contains a collection of distinct developer URI(s) along with their associated <name, id> pairs:

B. Interlinking id(s) within a repository
We compared each developer URI with every other developer URI in set L by comparing their respective <name,id> (where id means excluding everything after "@") pairs. If the match is found then we established an owl:sameAs link between both developer URIs as shown in Listing 3. The pseudocode of our approach is shown in IV.1.

C. Interlinking id(s) across projects
Let Prj A and Prj B (cf. (7) and (8)) be the sets which contains the collection of distinct developer URI(s) along with their associated <name, email> pairs extracted from the repositories of project A and project B such that: For each developer URI, we queried the owl:sameAs links (if any) in order to build a comprehensive list of <name,email> pairs. Later, we compared each developer URI in set Prj A with every developer URI in set Prj B by comparing their respective <name,email> pairs. If the match is found then we established an owl:sameAs link between both developer URIs (cf. Listing 3). The pseudocode of our approach is shown in IV.2. The outcome of our above described matching approach(es) is a set of RDF files in N-TRIPLES 9 format which contains the developer URIs having owl:sameAs links between them (e.g., see Listing 3). The RDF files were later loaded into the SPARQL endpoint so that it can be used to query development activity of a developer within and across projects.
V. EVALUATION Before we discuss the results of our matching approach(es), we describe the Apache projects selected for evaluation. We gathered data from software repositories of 5 Apache projects (c.f Table I). The reason of choosing Apache projects is that the repositories of these projects are on the Web and are available to download (i.e. mailing list archives, bugs etc.). We selected data from the beginning of each Apache project to date as listed in Table I

A. Interlinking id(s) within a repository
We applied our interlinking approach (cf. algorithm IV.1) on the mailing list and bug tracking repositories separately in order to identify and interlink multiple ID(s) of a developer within a repository. During the matching phase, we found certain ID(s) which were more generic and multiple developers in each project were associated to those ID(s). We excluded those ID(s) during the matching phase. Example of those ID(s) are: jakarta, jakarta-ant, lists, ant-dev, dev, ant, apache, general, hadoop, nutch-dev, log4j, log4j-dev, java-dev, lucene etc. The outcome of our interlinking approach is listed in Table II and Table III where the ''links'' column tells the total number of owl:sameAs links established between ID(s) and the ''id(s) found'' column tells the distinct ID(s) found with a match. From Table II and Table III, it is quite obvious that the usage of multiple ID(s) is mostly common on the mailing list than the bug tracking repository except the special case of Apache ANT project. However, it is still important to identify and interlink multiple ID(s) of a developer in every software repository. The same interlinking approach can be used to match developer ID(s) across different repositories (i.e., matching mailing list and bug tracking repository developer ID(s)) of a project.

B. Interlinking id(s) across projects
In order to identify and interlink ID(s) of a developer across different projects, we applied our interlinking approach (cf. algorithm IV.2) to the bug tracking and mailing list repositories of the Apache projects separately. We considered two ID(s) to be similar if either the <name,email> pairs of both ID(s) matched or if only the name or email address of the ID(s) matched. During the evaluation, we found match for certain developers who have used single word name (e.g., chris, brian, J etc.). As it is difficult to differentiate if the developers are same (based on a single word name), we only considered a match if the developer name consists of at-least two words (i.e., name surname). The results of our cross project matching approach based on mailing list and bug tracking repositories are shown in Table IV and Table V. The results shows that matching based on either name or email address generates more links comparing to the <name,email> approach. Also, there are good number of developers whose ID(s) are found in multiple projects. The most prominent among them are the developers of Apache ANT and Apache Maven project. The reasons behind it could be the technical dependencies between both projects (e.g., using project components) or the interest of developers working on similar kind/category of projects.

C. Comparison of developer identification approaches
In this section, we apply few methods suggested so far (see section II) on our data set in order to validate the effectiveness of these methods comparing to our approach. We applied the methods only on the mailing list data set because we found the usage of multiple ID(s) more on the mailing list comparing to the bug repository data set. In order to identify multiple ID(s) of a developer, we computed similarities between developer names and email addresses or between two email addresses using Levenshtein edit distance [22] algorithm, as suggested by Bird et. al in [6]. The match(es) are those developers with the lowest edit distance. Clearly, there is no sense in allowing arbitrary large distance values so we tested the methods by setting three different threshold values: 1) the maximal length of both strings 2) the minimal length of both strings 3) a fixed threshold value of 4 In order to identify the similarity between names and emails (i.e., Name-Email approach), we matched email addresses (excluding everything after "@") and developer names by doing a pairwise comparison. Besides names and emails, it is very likely that the emails are textually similar to each other if they belong to the same developer. Thus, we also computed the pairwise similarities between email addresses (i.e., Email-Email approach) and determined match(es) for one email by choosing those with the lowest edit distance. Later, we validate the match(es) produced by each method and discard the wrong match(es). The results of both methods comparing to our method is shown in Figure 2. The figure shows that our approach identifies more ID(s) comparing to other existing approaches. Although combination of existing approaches with our approach might produce even better matching results but this is not in the current scope of this paper.

VI. DISCUSSION
In this section, we discuss the research questions (cf. Section III) based on the results we achieved through interlinking multiple ID(s) of a developer.

RQ-1: What is the added benefits of interlinking multiple ID(s) of a developer?
In order to evaluate the added benefits of interlinking multiple ID(s) of a developer, we calculated the difference in the social graphs constructed from the mailing list data set and the number of bugs reported by a developer whose multiple ID(s) were not interlinked. We wanted to find out how much of the developer's development activity is not explicitly interconnected due to the usage of multiple ID(s) within a software repository.
We calculated the difference between social graphs of a developer who has multiple ID(s) for each Apache project. The social graphs was constructed by analyzing the reply structure of the email threads. The difference between two social graphs belonging to the same developer was calculated by computing the number of nodes (i.e. id) which are part of one social graph of a developer and is missing in the other social graph of that particular developer. This would reveal distinct nodes which are part of different social graphs of a developer and lacks a common connection (as motivated in Figure 1). We sorted the results and plotted it in the form of a chart as shown in Figure 3.
The charts plotted shows a potential difference in the social graphs of a developer who has multiple ID(s). For example, in Apache Maven project (cf. Figure 3), we found a developer whose multiple social graphs had a difference of 613 nodes. This means that there are 613 nodes which  are not shared among his multiple social graphs. We further computed the number of bugs reported by a developer using multiple ID(s) and plotted the results in the form of chart as shown in Figure 4. We stacked up the bugs reported by a developer using multiple ID(s) to show the total number of bugs he/she reported.
The results depicted in Figure 3 and Figure 4 shows that establishing owl:sameAs links between multiple ID(s) of a developer helps in better integration of data sources from 0  10  20  30  40  50  60  70  80   0  10  20  30  40  50  60  70  bugs reported   developers   Apache Ant   id A  id B  id C   0  10  20  30  40  50  60  70  80   0  1  2  3  4  5  6  7  8  the research (i.e., carrying out social network analysis) as well as the software development point of view (i.e., tracing developer activity in a repository). However, based on the small number of Apache projects we selected for this study, the bugs reporting activity using multiple ID(s) are not significant comparing to the mailing list's social graph differences but we believe that one might find significant differences if it is applied to a large number of open source projects.

RQ-2:
What is the ratio of developer's existence in multiple projects?
Based on the evaluation data from 5 Apache projects, we queried owl:sameAs links of each developer across different Apache projects under consideration. Later, we counted the number of projects for which a developer has an owl:sameAs link and plotted the results as shown in Figure 5. The chart shows that developer's participation in multiple projects follow Pareto's law because of the large number of developer's participation in 2 Apache projects in contrast to participation in 3,4 or 5 Apache projects. For each Apache project, we found a small subset of developers participating in at-least 4 Apache projects. It shows us a great tendency of developer's participation in multiple projects. It further opens up new research challenges in order to access, evaluate and analyze developer's behavior while contributing to multiple projects in a given period of time.
RQ-3: What are the contributions made by developers in multiple projects?
In order to trace developer's activity across multiple projects, we queried and retrieved owl:sameAs links of each developer. Later, we computed the development activity of only those developers who participated in multiple projects. In Figure 6, we have summarized the development activity of few developers who contributed to or participated in at-least 2 Apache projects.  The figure shows that only few developers have source control commit rights on multiple projects while most developers have commit rights on a single project. Further, the results shows that developers are mostly active in exchanging emails and participating in bugs related discussions comparing to bugs reporting while contributing to multiple projects. There is a possibility that the developers contributed to the source-code of multiple projects by submitting source-code patches to the mailing list or bug tracking repository. We opt out this particular study as it requires interlinking sourcecode authorship to the developers which is not in the current scope of this paper.
Based on the results, we believe that if the same methodology is applied to a large number of open source projects then we will see a significant amount of developer's activity across multiple projects.

VII. CONCLUSION
We have motivated and proposed a simple yet effective approach to identify and interlink developer ID(s) within different software repositories of a project as well as across different projects. We have shown that explicitly establishing the interconnection (i.e., Linked Data approach) between multiple ID(s) of a developer helps in integrating, querying and tracking developer activities in software repositories of a single project as well as across multiple projects.
We plan to combine existing developer identification methods with our methodology which might help in improving the interlinking approach, yielding higher quality and quantity links for developer ID(s). We further plan to integrate other potential developer-related open source repositories (i.e., analytical services like Ohloh, code forges etc.) as discussed elsewhere [14]. We also plan to include more Apache projects and establish cross-project interlinking among developer ID(s). Based on that, we will study the social behavior of developers who had contributed to different projects in a given period of time. We believe that this will give new insights into the social and technical aspects of open source software developers.