Interinstitutional Research Team Formation Based on Bibliographic Network Embedding

Future IT Innovation Laboratory, Pohang University of Science and Technology, 77, Cheongam-ro, Nam-gu, Pohang-si 37673, Gyeongsangbuk-do, Republic of Korea Department of Computer Science and Engineering, Pohang University of Science and Technology, 77, Cheongam-ro, Nam-gu, Pohang-si 37673, Gyeongsangbuk-do, Republic of Korea Department of Creative IT Engineering, Pohang University of Science and Technology, 77, Cheongam-ro, Nam-gu, Pohang-si 37673, Gyeongsangbuk-do, Republic of Korea


Introduction
Research collaborations are one of primary features that affect performances of the research [1][2][3][4][5][6]. e existing studies for the research team formation concentrated on synergies between scholars [7][8][9][10]. To predict the synergies, analyzing or embedding bibliographic networks has recently been the most popular approach [11,12]. ese studies searched for adequate collaboration partners of each scholar by analyzing his/her research history. ey supposed that structures of bibliographic networks reflect reputations (e.g., the number of citations), research topics (e.g., preferred venues), and even working styles (e.g., sustainability of collaborations) of scholars [1,5,6].
However, this approach does not consider that scholars are not the sole stakeholder of research. As employers and funding sources, research institutes influence research directions and outcomes of scholars. For interinstitutional research projects, the institutes evaluate team members and counterparts of their joint projects according to individual research interests and purposes, as we carefully choose our collaboration partners. For example, POSTECH (Pohang University of Science and Technology) is a research-oriented university. is institute encourages its members to publish research articles with scientific impact. On the other hand, RIST (Research Institute of Industrial Science and Technology) aims to develop practical technology and prefers patents rather than papers.
us, when members of POSTECH find their collaborators, they may prefer scholars who published many high-impact papers in other researchoriented institutes. However, if scholars in POSTECH want to commercialize their research outcomes, scholars in RIST can be a collaboration partner. Individual expertise and interest of the institutes can be discovered from bibliographic networks. Scholars in POSTECH will focus on papers rather than patents, and RIST might be contrary to POSTECH. Also, as a university, POSTECH covers much broader research areas than RIST. us, contributions from POSTECH will be published at more various venues compared to those from RIST. A comparison of POSTECH with RIST shows differences caused by types of research institutes. However, within the same type, research institutes have individual characteristics according to their research interests. Figure 1 shows topic distributions of papers published by scholars in three major research-oriented universities in Korea. Although the three institutes share common research topics, their priorities for the topics are different. Also, the priorities can be correlated with infrastructures for each research field. In team formation for a project, we should match the project's research fields with participating institutes' expertise.
To conduct the research team formation, we should consider both characteristics of each of the scholars and their affiliation.
ere can be scholars who prefer intrainstitutional collaborations or are not familiar with collaborations. Scholars can also prefer particular types of institutes as collaboration partners (e.g., companies or universities). We can extract collaboration styles of both stakeholders (scholars and institutes) from the bibliographic networks. First, affiliations of collaborators of each scholar reveal what kinds of institutes are preferred by the scholar as collaboration partners. Second, venues of publications written by the scholars show their research interests. Finally, structures of the bibliographic networks represent more detailed research styles of the scholars, such as whether they focus on a few high-quality papers or write prolifically [5,6]. Also, the structures can reveal working styles of research groups; for example, all group members focus on a research topic, the group leader manages multiple independent projects, or plural middle managers lead individual projects [1]. erefore, in this study, we propose a method for forming research teams that can consider both collaboration styles of individual scholars and aims of research institutes by embedding bibliographic networks. First, this study suggests the interinstitutional collaboration network (Figure 2). is network includes information for research institutes and projects funded by the institutes, which are barely dealt with by the existing studies. en, we apply the substructure-based graph embedding methods [1,[13][14][15][16]  Based on the assumptions, we propose three approaches for the interinstitutional team formation: (i) considering every collaboration, (ii) focusing on collaborations between target institutes (based on RQ 1 and RQ 2), and (iii) focusing on collaboration outcomes preferred by the target institutes (based on RQ 1, RQ 2, and RQ 3). e three approaches were evaluated based on research outcomes of POSTECH and RIST from 2011 to 2020. By comparing (i) with the other two approaches, we can validate RQ 1. A comparison of (i) with (ii) can verify RQ 2. Finally, RQ 3 can be validated by comparing (iii) with the others and examining performances of the proposed methods for different types of publications (e.g., papers and patents). Contributions of this study can be categorized as follows: (i) Modeling and embedding the interinstitutional collaboration network: is study proposes a novel bibliographic model representing interinstitutional collaborations and a model for embedding the proposed network. Finally, we propose three approaches for predicting collaboration probabilities by using the embedding vectors. (ii) Discovering features of the interinstitutional team formation: e three approaches for team formation are based on individual features. e first one focuses on collaboration styles of each scholar. e second and third approaches consider collaboration styles of both research institutes and scholars and research interests of the institutes, respectively. us, experimental results for the approaches can exhibit these features' significance for the interinstitutional team formation. (iii) Validating distinctiveness of interinstitutional collaborations: e comparisons between the three approaches also validate the fundamental assumptions of this study. e validation assures that we need specialized methods for composing interinstitutional research teams. Our findings can also be applied to other bibliography analysis tasks, such as predicting research institutes' performances and matching employers (institutes) and employees (scholars). e remainder of this paper is organized as follows. Section 2 introduces the existing studies for the research team formation. In Section 3, we introduce the interinstitutional collaboration network, and we propose methods for embedding the network and for composing interinstitutional research teams. Section 4 explains experimental procedures for evaluating the proposed methods and validates their effectiveness based on the experimental results. Section 5 presents concluding remarks and future research directions.

2
Mobile Information Systems

Related Work
ere have not been studies for forming interinstitutional research teams. Although Purwitasari et al. [17] have proposed a team formation method for interdepartmental research collaboration, this method considers only topics of publications and does not consider departments/institutes as one of the stakeholders of research. Hernandez-Gress et al. [18] analyzed bibliographic data to recommend collaborations between universities by using only research topics of each scholar. Additionally, Guerrero-Sosa et al. [19] analyzed internal and external research collaborations of Universidad Autónoma de Yucatán, but their analysis results were limited in the data statistics. erefore, this study    validates whether research institutes are significant stakeholders of research and proposes team formation methods that can consider the interests of both institutes and scholars. Looking up from the interinstitutional research, there have been numerous studies for recommending research collaborators. Most of the existing studies applied link prediction techniques on bibliographic networks. ey extracted various features from research publications or bibliographic networks by searching for scholars who can potentially (or sustainably [20,21]) collaborate. Structures of bibliographic networks provide various information for bibliographic entities (e.g., scholars, publications, and venues) [1,5,6]. Regarding scholars, coauthorship relations show which types of collaborators are preferred by each scholar [1]. Temporal changes in coauthorship relations also reveal the sustainability of collaborations [6]. By analyzing structures of citation networks, we can extract publications' scientific impact and topical relevancy between publications [20]. Even without citations, relations between scholars and venues partially represent research topics of scholars [5].
erefore, various studies [8-10, 12, 20-24] attempted to extract structural features of the bibliographic networks and to apply to predicting future coauthorship. To deal with the structural features, affinity propagation based on random walks was the most popular [9,10,20,25]. However, recently, network embedding models enable us to represent the structural features by using low-dimensional fixed-length vectors [1,5,6,12,26]. Due to the vector representations, we can use conventional machine learning techniques to predict the collaboration probability without much modification.
ere have been mainly two kinds of embedding models: proximity-based and structure-based models. If we employ proximity-based models [26], the obtained vector representations will have high similarity for scholars in the same community. However, we can search for collaborator candidates in a circle of acquaintance by ourselves. Also, some scholars prefer collaborators who come from diverse research groups [1]. erefore, for the practicality of team formation methods, we have to provide unexpected collaborator candidates that are similar to previous collaborators of users.
is study employs a structure-based network embedding model and modifies it to apply to the proposed bibliographic network.
Although bibliographic network structures reflect research topics of publications and scholars, they are difficult to be as accurate as analyzing the publications' content. erefore, various studies applied topic modeling [7] and word/document embedding [9,26] techniques to textual data in publications with an assumption that the scholars who deal with similar research fields can collaborate together [7-10, 12, 26, 27]. Obviously, information for research topics is valuable for team formation. If we make matches between two scholars in irrelevant domains, they are difficult to collaborate however talented they are. Nevertheless, this assumption cannot deal with forming interdisciplinary research teams, despite its significance for pioneering new research areas and providing practical experiences to scholars [28,29]. We can also analyze probabilities of interdisciplinary research by combining the research topic information with bibliographic network structures. However, analyzing academic publications' content is out of coverage of this study. Our further research will attempt to cover the combination of two kinds of information.
Additionally, a few studies used statistical features extracted from bibliographic data. Bibliometrics (e.g., h-index) are effective to represent performance of scholars (and other kinds of bibliographic entities) with a single value [21,27]. However, each of the bibliometrics reflects only fragmentary aspects of research. When a scholar wrote a few high-impact publications, another scholar published numerous intermediate publications, and they have the same h-index, it is not difficult to say which scholar has a better performance than the other. Even a few existing studies validated that network embedding models can reflect features represented by the bibliometrics [1,5,6]. Also, career ages of scholars were used in several existing studies [10,21,27,30]. Nevertheless, this information is already included in bibliographic networks, and we do not always require collaborators who have similar career ages with us.
In summary, the existing methods have mainly two limitations. First, the existing studies suppose that scholars are the only stakeholder of research. However, as discussed in Section 1, research institutes have their own research interests and purposes. Also, scholars are influenced by the interest and purposes, as employees of the institutes. Second, sharing research topics or being active in the same research communities is not always good for research collaborations. To conduct research, which is a cooperative task, we need team members who can serve individual parts.
us, a method that can consider both scholars' diverse roles and research institutes' purposes is required.

Interinstitutional Research Collaboration Prediction
is study aims at composing interinstitutional research teams by considering characteristics of both research institutes and their members. We have improved the conventional methods in terms of the three following points: (i) e proposed bibliographic network model covers information for research institutes and projects. (ii) Substructure-based graph embedding methods enable us to reveal research interests and expertise of institutes/scholars. (iii) We propose the three approaches for learning collaboration history of target institutes. e approaches were evaluated and compared with each other in Section 4.

Interinstitutional Collaboration Network.
Most of the existing studies only use coauthorship relations for analyzing/predicting collaborations. However, using solely coauthorship has difficulties for discovering characteristics of scholars and institutes in collaborations, such as research interests, roles in research groups, and expertise. erefore, we extend the conventional bibliographic network, which consists of scholars, publications, and venues, to cover research institutes and projects. e proposed network model is defined as follows.
is study defines the bibliographic network as a heterogeneous network, which has multiple kinds of nodes and relations. e bibliographic network (N) contains five kinds of nodes: scholars (A), publications (P), venues (V), institutes (I), and projects (F ). Between these nodes, there are five kinds of relations: a scholar "writes" an academic publication (W ∈ R |A|×|P| ), an academic publication is published in a venue (P ∈ R |P|×|V| ), a scholar "is affiliated in" an institute (A ∈ R |I|×|A| ), a scholar can "participate in" a project (M ∈ R |A|×|F | ), and an academic publication can "be a result of" a project (R ∈ R |F |×|P| ).
is can be formulated as Edges in the network represent only existence of the relations, and the edges connect only heterogeneous nodes (not necessary to annotate edge directions). us, the interinstitutional collaboration network is undirected and unweighted. Figure 2 illustrates an example of the biblio- As shown in Figure 1, we can reveal characteristics of research institutes, such as their preferences for research fields, using only publication records. However, this information does not include collaboration styles of the institutes. Also, we assume that scholars' choices for their collaborators are different according to their and their collaborators' affiliations. is point can be revealed by a metapath, I-A-P-A-I. is metapath represents preferences of research institutes for partner institutes. Research interests and aims of the institutes will also be reflected by I-A-P-V. Projects nodes enable us to know whether joint projects between target institutes have been successful or not (I-A-F-(−P)-A-I). We can also analyze the sustainability of interinstitutional teams after the joint projects are finished.
e sustainable teams will be benchmarks for composing productive research teams.

Bibliographic Network Embedding.
Adjacency-based graph embedding methods (e.g., LINE [31]) can be effective for revealing preferences of scholars and research institutes. If a i and a j collaborate frequently and a j wrote a number of publications with a k , these methods will assign close vector representations to the three scholars. en, a k will be one of collaborator candidates of a i with high priority. When all the three scholars have similar roles in their collaboration, this recommendation is reasonable. However, scholars with the same expertise will not have much motivation for collaboration. If a j has been advising a i and a k as a domain expert, a i and a k will not have much reason to work with each other.
Our previous study [1] showed that substructure-based graph embedding methods can resolve this issue. ese methods assign similar vector representations on nodes that have similar substructures. In the above example, if a j prefers applying his/her own expertise to various domains, substructures rooted in a j will have the star topology. e various domains will also be revealed by diversity of scholars and venues connected with a j . Otherwise, a i and a k will be connected with less diverse venues than a j .
is point is the same for discovering characteristics of research institutes and projects. Universities will have connections with more various venues than nonuniversity research institutes, which mostly have particular research fields. Also, participants of pure research projects will be members of universities rather than of companies. On the other hand, both universities and companies will participate in projects for technology commercialization. erefore, this study applies Subgraph2Vec [13], which aims at embedding subgraphs rooted in each node, on the bibliographic network. Subgraph2Vec consists of WL (Weisfeiler-Lehman) relabeling process [32] and Word2Vec [33]. is model assigns close vectors on subgraphs rooted in the same (or adjacent) nodes.
First, WL relabeling is a method for describing subgraphs rooted in each node exactly. is method assigns new labels on each node by using labels of itself and its adjacent nodes, iteratively. For example, a i on Figure 2 has A, which is its node type, as an initial label. At the first iteration, we check labels of neighborhoods of a i , for example, I of i x , F of f α , and P of p a . en, a i gets a new label, A: I, F, P. By iterating this process, scales of subgraphs represented by the labels become wider. To observe network structures with multiple scales, we call labels generated at the d-th iteration "subgraphs on degree d" and describe substructures rooted in a node as a set of the subgraphs. In practice, we sort the labels of neighborhoods and apply the hash function on the new label to avoid making redundant labels. Algorithm 1 presents procedures of the WL relabeling on our bibliographic network model, where a (d) i indicates the subgraph rooted in a i on degree d, S denotes a subgraph dictionary, and D refers to the maximum degree.
To apply Word2Vec on subgraphs, we have to define ranges of their neighborhoods. In texts, sentences are sequences of words, and neighboring words can easily be extracted using sliding windows. However, nodes in networks are not sequential. erefore, we define neighborhoods based on adjacency of nodes and degrees as with the previous study [1]. Neighborhoods of a (d) i can be formulated as where W D is a widow size for the degree. e same way is used to compose neighborhoods for other node types.

Mobile Information Systems
To embed the subgraphs, we use the SkipGram and negative sampling [33]. is can be formulated as where P n (S) denotes a noise distribution of subgraphs, k indicates the number of negative samples, and Φ(·) denotes the projection function. In this study,

Research Collaboration Prediction.
We use the conventional MLP (Multilayer Perceptron) model to predict interinstitutional collaborations. e MLP model consists of three fully connected layers and one drop-out layer. Inputs of the model are 2 × δ-dimensional vectors composed by concatenating vector representations of two scholars. An activation function of this model's output layer is the sigmoid function, and the other layers use the ReLu (Rectified Linear Unit) function as their activation functions. is model predicts collaboration probabilities between two scholars, and scholar pairs are classified into two groups that are appropriate for collaboration and not. As a loss function, the binary cross entropy is applied.
In this study, we focus on the interinstitutional collaborations that should consider not only relationships between individual scholars but also relationships between research institutes and between scholars and institutes. Research institutes have their own purposes, and members of the institutes also should concentrate on occupational research. erefore, we cannot ensure that training the model to predict every collaboration (scholar-publication-scholar relations) in the bibliographic network is the best approach for learning the individual characteristics of research institutes. erefore, we propose two more approaches based on our research questions (in Section 1) to make the model reflect agendas of the target institutes and compare them with the conventional approach (i.e., learning all the previous collaborations). e three approaches for training the MLP model are as follows: e first case supposes that the bibliographic network embedding method can represent characteristics of research (1) ALGORITHM 1: WL relabeling process on the interinstitutional collaboration network. 6 Mobile Information Systems institutes and their collaborations despite their diversity. us, this case assumes that scholars' vector representations include information for purposes and preferences of the scholars' affiliations. In this case, the MLP learns all the collaborations in the bibliographic network, as shown in Figure 3(b), and we use the trained model to predict probabilities of further collaborations between scholars from target institutes. erefore, this approach makes the prediction model reflect the general characteristics of research collaborations. Although the general characteristics cover interinstitutional collaborations, this will not be as clear as focusing on only interinstitutional collaborations. us, we use this approach as a baseline for validating whether interinstitutional collaborations have distinctive characteristics compared to the others (RQ 1). e second case, which is based on RQ 1 and RQ 2, focuses on searching for scholars that are appropriate for collaborations between the target institutes. ere will be scholars who prefer collaborations but only intrainstitutional collaborations or only particular partner institutes. If a scholar has preferences according to reputations or types of institutes, our embedding model can extract the information from publications and venues connected with the institutes. e institutes will also concern whether the scholar can conduct research that they expect. For example, POSTECH and RIST are significant research partners of each other. However, not all the scholars in the two institutes participated in collaborative studies between the institutes. us, we can assume that there will be a certain type of scholars that are appropriate for mutual interests of the institutes. erefore, this case uses bibliographic networks that consist of scholars in the target institutes as a dataset. en, we train the MLP to predict whether a group of scholars from the respective institutes has previous collaborations, as shown in Figure 3(c). By comparing this case with the first one, we can reveal whether research institutes' characteristics affect their employees (RQ 2).
We have designed the third approach based on all the research questions (RQ 1, RQ 2, and RQ 3). is case especially concentrates on the fact that research institutes have individual agendas and preferable kinds of publications (RQ 3). us, we first find academic publications that are similar to outcomes of previous collaborations between the target institutes by clustering publications in our bibliographic network according to their vector representations. en, we search for scholars who have written publications that are in the same clusters with the previous collaboration outcomes. We assume that scholars are capable of conducting research that the target institutes expect from their collaborations. When publications that come from collaborations between POSTECH and RIST are in cluster A, research groups that wrote publications in cluster A will let us know compositions of research groups that are appropriate for collaborations between the two institutes. us, in this approach, the MLP model learns only the research groups which produced research outcomes that are similar to the previous collaboration outcomes of the target institutes, as shown in Figure 3(d). By comparing this approach with the others, we can validate whether research institutes have preferences for types or topics of publications (RQ 3).

Evaluation
To evaluate the proposed methods, we predicted interinstitutional collaborations by analyzing previous collaboration history. Also, our research questions were validated by comparing the performances of the proposed methods with each other. We supposed that research institutes have preferences for topics and types of their members' research outcomes (RQ 2 and RQ 3).
us, we should collect multiple types of academic publications, although the existing studies mostly dealt with one type. e multiple types caused a limitation in our experiments. Unlike papers with numerous well-organized academic databases (e.g., DBLP and Scopus), it is not easy to expect accurate publication records for patents or technical reports published by each research institute. us, we collected the paper dataset from the open academic databases and acquired a patent dataset by directly requesting it to research institutes. Due to this point, we could not conduct the experiments on a large-scale dataset for multiple research institutes. Nevertheless, publication records of research institutes include their collaborating institutes. us, the proposed methods made answers by analyzing hundreds of research institutes' characteristics, although they predict collaborations between a few institutes.
We collected papers and patents published by scholars in POSTECH and RIST from January 2011 to September 2020. e papers were gathered through the affiliation profile pages on Scopus1, and RIST provided bibliographic data for the patents. Our bibliographic network consists of the papers, patents, and every scholar/institute/venue connected with the papers and patents. We composed the network for two time periods: 2011-2015 and 2016-2020. e proposed methods were trained by the bibliographic data from 2011 to 2015 and validated based on the collaborations from 2016 to 2020. In our dataset, papers' author names are in English, and patents' inventor names are in Korean. us, we could not build a unified network for both types of publications. We constructed two separate networks and compared the performances of the proposed approaches on the two networks to validate whether research institutes (and their members) have distinct characteristics. Table 1 presents statistics of the bibliographic networks. e three approaches proposed in Section 3.3 were evaluated based on accuracy for predicting collaboration outcomes between POSTECH and RIST. e accuracy was assessed using three metrics: precision, recall, and F 1 measure. When we measure accuracy of predicting collaborations between i x and i y , these metrics are calculated as

Mobile Information Systems
where C(·, ·) and C(·, ·) are sets of predicted and actual collaborations between two institutes, respectively, and p(·, ·), r(·, ·), and F 1 (·, ·) indicate precision, recall, and F 1 measure for predicting collaborations between two institutes, respectively. We compared the performances of the proposed approaches with a performance of a baseline method and also with each other. As the baseline, we use Case 1, one of the proposed approaches, to predict all the collaborations. A comparison of this case with the proposed approaches exhibits the necessity of methods specialized in predicting interinstitutional collaborations. Table 2 presents experimental results. Additionally, we heuristically tuned hyperparameters of the proposed methods. e number of dimensions for subgraph vectors was 100, and the maximum degree was 4. e MLP model for predicting collaborations includes three fully connected layers that have 200, 150, and 80 nodes. e threshold of its drop-out layer was 0.2. Also, the number of epochs and learning rates were set as 50 and 0.0008, respectively.

RQ 1: Distinct Characteristics of Interinstitutional Research Collaboration.
e motivation of this study is that we need a collaboration prediction method specialized in interinstitutional collaborations. e necessity can be validated by comparing the performance of Case 1 with the performance of Case All. ese two cases present accuracies  Case All is for every collaboration. erefore, the result that Case All had higher accuracy than Case 1 also underpins RQ 1. e performance decrements between Case 1 and Case All were similar in predicting collaborated papers and patents. However, both Case 1 and Case All performed higher accuracy on papers than on patents. Otherwise, Case 2 and Case 3, which focus on the previous collaborations, performed higher accuracy on patents than on papers. We can assume that patents get more influence from the characteristics of interinstitutional collaborations than papers, although we should also consider that scholars in RIST barely write papers (62 papers from 2011 to 2020). For this point, we should experiment again with a larger dataset containing more institutes and publication types in further research.
Different diversities of publication types can also cause this result; papers are more diverse than patents. Precision and recall of Case 1 were similar to each other on predicting collaborated patents between POSTECH and RIST. However, its precision for predicting papers collaborated by the two institutes was much higher than its recall. is problem was worse in Case 2 that learns only the previous collaborated papers. Otherwise, Case All and Case 3 performed small deviations between their precision and recall on both patents and papers. ese two cases might be less affected by the diversities of publications. Since Case All learned general characteristics of the research collaboration, this case gains capability for handling the diversities. On the other hand, Case 3 searched for scholars who can produce the same kinds of research outcomes as the previous collaborations between POSTECH and RIST. us, this case learned both the diversities and the two institutes' characteristics. Conclusively, both types of research publications were affected by the distinct characteristics of interinstitutional research. However, due to the diversity of papers, we need more samples and better methods for extracting features of papers produced by interinstitutional teams.

RQ 2: Correlations of Research Collaborations with
Affiliations. Case 2 learns only interinstitutional collaborations between target institutes, while Case 1 is based on all the collaborations. Also, Case 2 emphasizes scholars who participated in the collaborations, compared to Case 3 that focuses on publications. Case 2 could not outperform Case 1 in predicting papers collaborated by POSTECH and RIST. However, this result might come from RIST's lack of interest in writing papers; the next section provides detailed discussions. Otherwise, Case 2 exhibited the best performance in predicting collaborated patents of the two institutes. Its accuracy nearly caught up with the accuracy of Case All (prediction for every collaboration). Additionally, Case 2 exhibited a reasonable precision (0.77) for predicting the collaborated papers, despite its low recall. Case 2 could extract characteristics of previous collaborations, but the characteristics did not have enough generality due to the lack of samples. ese results underpin that focusing on previous collaborations between the institutes is more effective than learning the general characteristics of the research collaboration.
RQ 2 is the assumption that characteristics of research institutes influence collaborations of their members. By comparing Case 1 with Case All, we found out that interinstitutional collaborations between particular institutes have unique characteristics compared to the other collaborations conducted by the institutes. en, Case 2 revealed that we could find and utilize the unique characteristics. Also, differences in accuracy for papers and patents might be caused by the fact that scholars in RIST barely write papers, and we did not restrict our dataset to research conducted on duty. In other words, research institutes have preferences for certain types and topics of research outcomes, and the preferences affect research of scholars in the institutes. Conclusively, we can say that characteristics of our affiliations affect our research and research collaborations.

RQ 3: Preferences of Research Institutes for Collaboration
Outcomes. Case 3 aims at finding kinds of publications preferred by the research institutes and forming research groups that are capable of producing the same kinds of research outcomes. Case 3 could not outperform Case 2 that concentrates on previous participants in collaborations between the institutes. However, performance gaps between them were not significant, and Case 3 had a higher recall for predicting collaborated papers than Case 2. ese results underpin that styles of expected publications are as significant as characteristics of scholars to predict interinstitutional collaborations. Also, the effectiveness of expected publications means that research institutes have preferences for certain kinds of research outcomes, although we do not know what exactly determines the "kinds." We can see this point also in a comparison of accuracy for collaborated papers with that for collaborated patents. Case 2 and Case 3 outperformed Case 1 in predicting collaborations that produced patents, while they showed contrary results in predicting collaborated papers. Case 1 had a strong point in learning more diverse scholars, publications, and venues, while Case 2 and Case 3 restricted the ranges of training data. us, we can assume that papers are more various than patents. Case 1 and Case 2 performed lower recall than their precision in predicting collaborated papers. For Case 2, the collaborated papers from 2011 to 2015 might not be enough to represent collaborations between POSTECH and RIST. However, different results of Case 1 and Case All are difficult to be explained. We carefully conjecture that there were changes in their collaborations for papers between the two time periods (2011 to 2015 and 2016 to 2020). To understand these results more clearly, we should conduct experiments with more institutes and publication types in further research. Differences in purposes of POSTECH and RISTcould worsen the issue. We interviewed staff of the technology licensing office of RIST to find the differences. According to the staff, RIST concentrates on applying its research outcomes for patents and restricts its members' academic papers. Otherwise, POSTECH is a research-oriented university, and it barely intervenes in the dissemination of research outcomes. us, the paper dataset was not enough to represent scholars in RIST; only 62 papers were written by members of RIST during the recent ten years, while 2,862 and 2,762 patent applications were published by RIST and POSTECH during the same period.
Conclusively, institutes expected particular styles of publications from their collaborations, and the expectation was effective for the interinstitutional research team formation. Also, there were significant differences between types of publications (e.g., journal articles, patents, books, etc.), and research institutes occasionally had preferences for publication types. However, we should construct a unified bibliographic network that includes more research institutes and various academic publications to validate RQ 3 more clearly.

Conclusion
is study aims at the interinstitutional research team formation. e existing methods for composing research teams barely considered characteristics of research institutes, although the institutes have individual research interests and aims. We have proposed methods for extracting features of both research institutes and scholars and methods for composing interinstitutional teams based on both sides' characteristics. First, we extended the conventional bibliographic network to represent research institutes' characteristics and embedded the network. Based on vector representations of scholars and publications, we have proposed three methods for predicting collaboration probabilities between scholars in target institutes. e three methods have different ranges of training data: (i) all the previous collaborations, (ii) collaborations between target institutes, and (iii) publications preferred by the target institutes.
We evaluated the three prediction methods and validated our assumptions by predicting collaborations between POSTECH and RIST from 2016 to 2020 by learning their collaborations from 2011 to 2015. From the experimental results, we found that interinstitutional research collaborations have distinct characteristics compared to other types of collaborations. Also, as we expected, publications of scholars were affected by their affiliations, and this influence obviously had correlations with collaborations of the scholars. Lastly, some institutes had preferences for particular types of publications. ese correlations and preferences were helpful for predicting future collaborations.
Despite the reasonable accuracy of the proposed methods, they have also shown several limitations as follows: (i) Scale of dataset: We conducted experiments for only two institutes, and we could not integrate the bibliographic data for papers and patents due to the author name disambiguation problem. We should construct a unified bibliographic network that includes more research institutes and various academic publications to validate RQ 3 more clearly. Also, our experiment for predicting collaborated papers could not be generalized enough due to the lack of Scopusindexed papers published by RIST. us, we should diversify our data sources, for example, collecting domestic journals and conferences. Considering more research institutes can improve this problem. (ii) Collaboration prediction methods: To predict interinstitutional collaborations, we simply used the conventional MLP model. Our assumptions were applied to only adjusting ranges of training and testing data. Although this approach performed reasonable accuracy and was enough to validate the assumptions, the accuracy can be improved by employing more sophisticated team formation methods. Also, we will attempt to combine the assumptions with prediction models. (iii) Content of academic publications: We supposed that publications' venues and authors imply the publications' content. However, this approach could not be as accurate as analyzing the content directly. Also, in the case of patents, their venues are patent offices of each nation. us, their venues can be correlated to their impact but not to research domains. is point will be the same for technical reports and preprints. In further research, we will attempt to combine content analysis for academic publications with the proposed team formation methods.

Data Availability
e bibliographic data used to support the findings of this study were supplied by RIST (Research Institute of Industrial Science and Technology) under license and so cannot be made freely available. Requests for access to these data should be made to RIST (http://www.rist.re.kr).

Conflicts of Interest
e authors declare no conflicts of interest.