Evolution of Software Development Effort and Cost Estimation Techniques: Five Decades Study Using Automated Text Mining Approach

Software development effort and cost estimation (SDECE) is one of the most important tasks in the field of software engineering. A large number of research papers have been published on this topic in the last five decades. Investigating research trends using a systematic literature review when such a large number of research papers are published is a very tedious and time-consuming task. Therefore, in this research paper, we propose a generic automated text mining framework to investigate research trends by analyzing the title, author’s keywords, and abstract of the research papers. The proposed framework is used to investigate research trends by analyzing the title, keywords, and abstract of select 1015 research papers published on SDECE in the last five decades. We have identified the most popular SDECE techniques in each decade to understand how SDECE has evolved in the past five decades. It is found that artificial neural network, fuzzy logic, regression, analogy-based approach, and COCOMO methods are the most used techniques for SDECE followed by optimization, use case point, machine learning, and function point analysis. The NASA and ISBSG are the most used dataset for SDECE. The MMRE, MRE, and PRED are the most used accuracy measures for SDECE. Results of the proposed framework are validated by comparing it with the outcome of the previously published review work and we found that the results are consistent. We have also carried out a detailed bibliometric analysis and metareview of the review and survey papers published on SDECE. This research study is significant for the development of new models for cost and effort estimations. Techniques: Fuzzy logic, ANN, regression, analogy based approach, cocomo, optimization, use case points, machine learning, function point analysis, cocomo ii, CBR, particle swarm optimization, feature selection, support vector. cocomo, fuzzy logic, optimization, ML, use case points, function point cocomo ii, clustering, particle swarm optimization, CBR, vector, feature selection, soft trees, ensemble, support vector, artiﬁcial bee dataset: ISBSG fuzzy approach, regression, optimization, use analogy, machine learning, function point analysis, GA, cocmo ii, particle swarm optimization, clustering, support vector, ISBSG, neural, optimization, machine learning, use function Techniques: Fuzzy, ANN, regression, analogy based, cocomo, clustering, function point, soft computing, GP, GRA, machine learning, radial basis function, use case point, cocomo ii, CBR, metrics: MMRE Techniques: ANN, fuzzy approach, regression, analogy, cocomo, GA, clustering, ML, function point, soft computing, genetic programming, regresion, polynomial NN, Techniques: function case based, regression Techniques: Calibrating, empirical calibration


Introduction
Software development e ort and cost estimation (SDECE) is a process of estimating the e ort and cost required for software development and is one of the most important activities of software engineering.
ere exist several research papers on this topic. Some papers talk about the software development e ort estimation (SDEE) [1][2][3][4][5][6][7][8][9] and the others talk about the software development cost estimation (SDCE) [10][11][12][13][14][15]. It is very common that the terms "software e ort estimation" and "software cost estimation" have been used interchangeably in the literature. However, software cost estimation is an outcome of software e ort estimation [16]. e ability of sales consultants, presales consultants, project managers, delivery managers, and delivery heads to determine accurate costs depends on the amount of detailing and care that has been taken to estimate e orts. Estimating accurate e ort and cost have an important role in the success of the software project. Over the past ve decades, there has been a signi cant increase in the complexity of software projects. is has led to the design and implementation of numerous techniques for estimating the e ort and cost of the software development and its consequent discussion in literature.
us, there exist several studies on SDEE, SDCE, systematic reviews on SDEE, and systematic reviews on SDCE. However, there is a lack of research that analyzes research trends and techniques that have evolved in the last five decades. ere is also a need to do a systematic bibliometric analysis of articles published on SDECE in the last five decades. Analyzing such vast research papers published on this topic in the last five decades is a very tedious and time-consuming task. Considering this fact, in this research paper, we proposed a generic text-mining-based framework to analyze a vast range of articles and investigate research trends and techniques used for SDECE in the last five decades. e framework is based on natural language processing and it is an automated process. e advantage of using a text-mining approach is that it significantly reduces manual efforts required to investigate research trends and patterns from the corpus of the documents on specific topics like SDECE [55,56].
is has motivated us to conduct this research study based on the text mining mechanism. e proposed framework is very generic and can be used in any other domain where a large number of research articles are published and need to be investigated in a manner that may be similar to this study. In this study, we analyzed 1015 research articles indexed in the Scopus database. e objectives, research questions, and contributions of this study are as follows.

Research Objectives
(1) To propose a generic automated text-mining framework to analyze a large number of research papers, for identifying changing research trends in technologies, methodologies, frameworks, tools, and techniques in an identified area or topic of any scientific or social science field. is paper was carried out to understand how SDECE techniques have evolved in the last five decades.
(2) To investigate frequently used techniques, accuracy measures, and datasets for SDECE using the proposed text mining framework (3) To validate the proposed framework to ensure consistent outcomes (4) To do a systematic bibliometric analysis of studies on SDECE (5) To do a comprehensive metareview of the review and survey papers on SDECE e paper is organized as follows: in the second section, we present the research method. e third section presents a metareview of the review and survey papers. Results of the automated text mining framework and bibliometric analysis are presented in the fourth section. In the fifth section, we validate the results of the proposed text mining framework. e threats to the validity of the study are explained in the sixth section. Finally, we conclude paper in the seventh section.

Research Method
To achieve the stated objectives of this study we have analyzed select 1015 articles from the Scopus database. We had three options to select articles on SDECE from indexing databases including Scopus, Web of Science, and Google scholar. ese are the three most popular and widely used online indexing databases by researchers. We decided to use the Scopus database because we could download all required data about research articles, such as the title of study, year of publication, the number of citations, source of the article (Journal, conference, etc.), author's keywords, abstract, document type, and authors information in CSV file format. e search string used for finding the documents from Scopus database was decided by taking into account the objectives and research questions of the study. e search terms used were "software effort estimation" OR "software cost estimation". We used this search string because it was needed to limit the study to the research papers that discuss software effort estimation and software cost estimations. e search of documents was done on 23 May 2020. We exported the search results in CSV (excel) file format. In the title column of the extracted data, we found that some of the titles were not research papers but the titles belonged to conferences, symposiums, and workshops. So, we decided to remove those titles from our list. We removed the following type of titles from the search results: (i) 48 conference titles; (ii) 6 symposium titles; (iii) 2 annual conventions titles; (iv) 6 international work-shop titles; (v) 1 conference review title; and (vi) 1 conference note. We also found that in search results there were 26 non-English papers, so we removed those papers from the list. us, in total we removed 90 titles from the original search results and selected 1015 papers for purpose of this study. Later, by reading the title of the research papers, we checked whether all selected 1015 research papers are on SDECE and we found that all of them were relevant. us, other than the criteria that the paper should be on SDECE and written in the English language, we did not use any exclusion criteria. e analysis to investigate research trends and techniques used for SDECE was done separately for title, keywords, and abstract of the research papers. We did the analysis separately because we wanted to check whether the results of the analysis based on the title, abstract, and keywords of the research papers are consistent or not. We used "wordcloud" and "tm" packages in "R" programming language for the text mining task. e bibliometric analysis of 1015 is done using the following information of the research papers: title; authors; year of publication; source title; cited by; affiliation; and document type. e detailed results of both bibliometric analysis and text mining are presented in Section 4.

Metareview of Review and Survey Papers on SDECE and Related Work
Several studies have been published on the topic SDECE in the last five decades. In this section, we present a detailed metareview of review and survey papers. Out of the selected 1015 articles, we found 39 review/survey papers, which include 13 journal articles, 25 conference papers, and one book chapter. Among the 39 review/survey papers, 9 papers that were published in year 2018 and 2019 did not receive any citation till May 2020. e remaining 30 papers received a total 1636 citations. e main findings of the 39 review/ survey papers are presented in Table 1. For some studies, data such as duration of the study and number of papers reviewed were not available so we could not include those details in Table 1. Some existing studies have used a text mining approach for identifying research trends in different areas. Garousi and Mantyala [55] used text mining to identify research themes, hot and cold topics in software engineering. e study conducted by Nie and Sun [85] used text mining to identify major academic branches and identify research trends in design research. Sehra et el. [56] conducted a study to identify research patterns and trends in software effort estimation using a text mining approach. e study was conducted by applying text mining on articles published during the period between 1996 and 2016. In all these studies, it is found that usage of the text mining is an adequate choice for better assessment when large number of articles needs to be assessed to understand the research trends, research themes, hot and cold topics in an identified research area. However, we found that there is a lack of research on (i) investigating research trend in SDECE in the last five decades; (ii) identifying the most popular SDECE techniques in each decade to understand how SDECE techniques have evolved in the last five decades; (iii) investigating research trends by analyzing title, keywords, and abstract of research papers separately to understand whether results are consistent or different; (iv) metareview of the review and survey papers published on SDECE in the last five decades; and (v) bibliometric analysis of the papers published in the last five decades. erefore, in this study, we attempt to fill these gaps.
Based on a metareview of the review and survey papers, we have identified the most used (i) SDECE techniques, (ii) datasets, and (iii) accuracy measures for SDECE. e most used SDECE techniques are shown in Figure 1. e most used datasets and accuracy measures are given in Table 2

Research paper Findings
Journal papers [17] is study classifies cost estimation models into five different categories along with detailed explanation of each category. e techniques are classified as i) Model based approaches: SLIM, COCOMO, checkpoint, SEER; ii) expertise based models: Delphi, rule based; iii) learning based models: ANN, robust; iv) regression models: OLS and robust; and v) composite models: Bayesian and COCOMO II [10] e study reviewed 304 papers from 76 journals. Research papers published before April 2004 were included in the study by manual search. Focus of the review was to classify papers based on research topic, research approach, SDEE technique, and datasets used for the study. e study also listed important cost estimation journals, research topics, research approaches, estimation approaches, context of the study. [16] e study reviewed 84 articles during the period 1991 to 2010. Four different aspects of ML models were reviewed: ML technique; accuracy of estimation using ML technique; comparison of ML models; and estimation context. Finding of study are that accuracy of ML models is better than non-ML models/techniques. [37] Conducted systematic empirical analysis of 10 local and global models of SEE. Study found that the results obtained are different for local and global methods of SDEE because of different experiment design and datasets. [41] Reviewed 21 articles describing neural network based models for SEE. e study reports range of features used for SDEE using ANN. e important finding of the study are as follows: i) ANN gives better results compared to regression, classic COCOMO model, SLIM FPA; ii) most of the researcher used COCOMO dataset; iii) the most used accuracy measures are MMRE, MdMRE, MRE, pred, MMER; iv) the most used neural network is feed forward neural network; [57] e study reviewed 129 articles during the period 2000 to 2014 and discussed usefulness and limitations of the ISBSG dataset used for SEE. About 70% papers used ISBSG dataset for SDEE and 36% papers used ISBSG dataset to study its properties. 55% papers used ISBSG dataset and others used complementary datasets for SEE. e study also highlighted that the most common methods used for SDEE are regression, machine learning, and estimation by analogy. [58] Review period of this study was from 1991 to 2016. e study reported that because of changing nature of the software development and its complexity several estimation techniques are evolved. e study also reported that for improved results several data mining and machine learning techniques are used along with conventional methods of SEE. [59] e study reviewed 101 articles during the period 2006 to 2015. e study reviewed papers related to cost estimation using agile software development. e study reported most popular SDEE techniques, accuracy measures, and project success rate over the years. ANN and expert judgment are the most used techniques for agile SDECE. MRE, MMRE, MdMRE, and pred are most used accuracy measures. [60] Review period was from 2000 to 2017. e articles are reviewed with respect to type of soft computing or machine learning techniques used for SEE. e study reported that COCOMO, NASA, ISBSG, DEHANAI are the most used datasets and MMRE and PRED are most used evaluation metrics. It is also reported that ANN is most used estimation technique. [61] e study analyzed 20 papers on SDCE tools. e review concluded that most of the tools are based on COCOMO model. [3] e study reviewed models built using ML techniques for SEE. e study reviewed 75 papers during the period 1991 to 2017. e study found that i)ANN is widely used ML technique; ii) MMRE is widely used accuracy measure; iii)ANN and SVM outperformed the other techniques; iv) Regression is non-ML technique widely used for effort estimation. [62] e study reviewed 74 articles from the period 2000 to 2017. Eight types of techniques found to be used for SEE. e study found that i) most used datasets are ISBSG, COCOMO, NASA93, NASA, desharnias, albercht, sdr, China, kemerer, miyaki, maxwell, Finnish. ii) Most widely used methods are ANN, CBR, linear regression, fuzzy logic, GA, kNN, support vector regression, logistic regression, and decision tree. iii) Most used accuracy measures are MMRE, MdMRE, PRED. [63] Discussed issues of estimating cost of software projects. Conference papers [64] Paper reports survey results on SDEE technique used in JPL laboratory. It is found that i) most technical staff use informal analogy and high level partitioning of requirements, and ii) staff was better in estimating effort than size.
[5] e findings are based on surveys on SDEE and the findings are as follows: i) 60-70% projects encounter effort or schedule overrun; ii)30-40% projects encounter cost overrun; iii) frequent method used for estimation is expert's judgment. [65] e study analyzed 112 projects from Chinese software industry. e survey investigated estimation methods, accuracy of method, and factors influencing adoption of certain method. e main findings are as follows: i) e large projects are prone to cost and schedule overrun, and ii) about 15% organizations used model based methods. [66] Paper provides compressive overview of analogy based SEE. Paper also discussed analogy based tool and systems, dataset quality and its relevance in predicting SEE. [67] is study reports the review of three parametric models used for SDEE namely: SLIM-putnam 1979, SEER-SEM 1989, SPR-knowledge plan 1999. [68] is study reports result of survey analysis on SDEE from industry perspective such as abilities of software organizations to apply SDEE technique and actually use techniques for effort estimation. e study also reports requirement of SDEE identified on the basis of survey and are compared with the requirements of existing methods. data; (ii) feature engineering is not required to be done. e weaknesses are as follows: (i) large amount of data is required for training, therefore it is computationally expensive; (ii) it is difficult to understand the reasoning behind the results, so interpretation of the results is difficult; (iii) it may suffer from over-fitting problem; (iv) cannot deal with missing values; (v) categorical values need to be converted to the numeric type. (C) Analogy-based approaches: the strengths are as follows: (i) Easy to understand the reasoning behind the outcome; (ii) can deal with outliers. e weaknesses are as follows: (i) computationally intensive; (ii) sensitive to the similarity function; (iii) categorical variables need to be converted to numeric type; (iv) cannot handle missing values; (v) difficult to get the solution if similar work has not been done in the past. (D) Fuzzy logic: e strengths are as follows: (i) It is based on the theory of classes with soft boundaries so it can deal with uncertainty in the data caused by measurement error during data collection; (ii) it can also deal with uncertainty in the model; (iii) gives improved performance if combined with ML or non-ML models; (iv) it is similar to human reasoning process. Its only weakness is that it becomes computationally intensive when combined with ML or non-ML models.  [69] is study reports cost and schedule estimation approaches for component-based software development. Analysis of published work is done with respect to modeling techniques, data requirement, type of estimation, and lifecycle activities. [70] is survey reports results of reliability of expert's judgment for SDCE in a medium sized software company. e study also reported that cost estimation based on expert's judgment is unreliable. [71] e study reports overview and usefulness of ANN for SDEE and its accuracy. [72] e study reviewed 19 articles from the period 2000 to 2014. e focus of review was to determine whether use of feature weighting technique (FWT) in CBR improves SDEE prediction accuracy. e study concluded that use of FWT in CBR improves SDEE prediction accuracy.
[73] e study reviewed articles pertaining to SDEE and concluded that every technique has its own advantages and disadvantages and there is no globally accepted single technique for SEE. [74] e study reviewed 167 papers from the period 2000 to 2013. e study reports statistics about usage of variables in ISBSG dataset for SEE. e study found that variables with missing values are less frequently used.
[75] e study reviewed 16 articles only. Reviewed articles were classified using nine criteria for global software development (GSD). It is found that the dominant contribution of GSD research was the models and software development cost. [76] e study reviewed article on the tools and frameworks developed for SDEE using use case point model. [59] is study reviewed various soft computing techniques such as genetic algorithm, neural networks, fuzzy systems, particle swarm optimization used for SDEE in agile software development. e study found that soft computing techniques provide better accuracy estimation.
[77] e study reviewed 15 articles. e findings of the study are as follows: i) company's use expert's judgment for SDEE; ii) need to improve algorithms and prediction techniques.
[78] e study reports 8 common approaches used to find k value for analogy based SDEE techniques. It is reported that due to varied performance of different approaches in finding k values resulted in conflicting results.
[79] e study conducted review of article to find which estimation method is best. It is found that use case point analysis approach is better that function point analysis and COCOMO model. [80] e study reviewed 10 articles. e survey analyzed contribution of papers in estimating effort with respect time, cost, and test. e study found that supervised learning algorithms are most popular for effort estimation. [81] e study reviewed 41 papers on ML based SDEE from the period between 2000 and 2017. e study discussed ML techniques, size metrics, benchmark datasets, and validation methods for SEE. It is found that i) most used techniques: Fuzzy logic, ANN, GA, anlogy based, SVR, bayesian network, regression tree, CBR; ii) dataset used: NASA, ISBSG, albrecth, COCOMO, desharnais, kemerer, kotengray, maxwell; iii) performance measures: MRE, MMRE, pred, MdMre, MMER, MSE, RMSE, standard deviation. [82] e review was conducted to understand importance of nonfunctional requirement in SDEE. e study identified nonfunctional requirements used in SDEE and how they are used. It is also found that use of nonfunctional requirements in SDEE brings down error by 30%. [83] e study reviews cost estimation techniques and presents strength and weakness of the techniques.
[4] e study reviewed use case-based effort estimation methods and provides factors contributing to use case effort estimation. Provides inputs on criteria to evaluate accuracy and effectiveness of the models. [3] e study reviewed 30 articles on 'bio-inspired feature selection algorithms' during the period 2007 to 2018. It is fount that genetic algorithm (GA) and particle swarm optimization (PSO) are widely used bio-inspired algorithms. Results of GA and PSO are better than baseline estimation techniques. [21] e study discussed limitations and accuracy of the function point analysis method.
[84] e study: i) Reviewed papers that describes models, processes, and practices and ii) proposed a general prediction process and framework for selecting predictive measures. e strength is that it is very useful for project planning, control, and budgeting. Its weakness is that it is based on calibration of the past experience. Difficulty in estimation arises with unprecedented situation. (H) Expertise-based estimation: Its strengths are that it is very useful when no quantifiable or empirical data is available. Its weakness is that it is purely based on knowledge and experience of the expert, so estimation is just opinion and it can be biased and may go wrong.

The Generic Automated Text-Mining Framework and Bibliometric Analysis
is section is divided into two parts: first part explains the generic automated text-mining framework to study the evolution of SDECE in the last five decades and the second part presents a bibliometric analysis of the selected 1015 research papers.   e Generic Automated Text-Mining Framework for Identifying Research Trends and Patterns. In this section, we present the generic automated text-mining framework and use it to investigate research trends and techniques used for SDECE by analyzing the title, abstract, and author's keywords of the selected 1015 research papers published in the last five decades. e framework is shown diagrammatically in Figure 2. e text mining is applied to (i) title, (ii) abstract, and (iii) authors' keywords of research papers. We have used "tm," "RWeka," and "wordcloud" package in the "R" tool. e steps used for text mining are as follows: Step 1: e title of research papers was first loaded in "R" from the CSV file downloaded from the Scopus database Step 2: We then created a corpus of the documents, where each title is treated as a separate document Step 3: e third step was text cleaning, and we performed following text cleaning tasks: (i) Converting text to lower case (ii) Removing punctuations, whitespace, numbers, and special characters from the text (iii) Removing the stopwords. Stopwords are the words that occur very frequently in the document, such as "the," "this," "and", but do not help in extracting any meaningful insights from the text data Step 4: e next step was to create tokens of the words and find their frequency using "NgramTokenizer" function in "Rweka" package in 'R' Step 5: e last step was to create WordCloud using word-frequency table created in the third step We repeated the above process for the abstracts and authors' keywords of the selected 1015 papers. We also performed decade-wise analysis using title, abstract, and keywords of the research papers published in each decade. We stored the results of Step 3 (word frequency table) in CSV file format so that we could cross-check WordCloud and word-frequency table.
We address the following research questions using the proposed framework:  Table 3. e first column in the table shows WordCloud using title, second column shows WordCloud using authors' keywords, and the third column shows WordCloud using the abstract of the research papers. e prominent words in each WordCloud are given just below the WordCloud for better understanding. ese prominent words indicate the most commonly used/referred/discussed techniques for SDECE. e first row in Table 3 shows WorldCloud of the papers published between the year 1971 and May 2020. e other rows in Table 3 show WordCloud of research papers published in each decade.
From Table 3, it is observed that the five most common techniques used for SDECE in the last five decades (1971 to May 2000) are fuzzy logic, artificial neural network, regression, analogy-based approach, and COCOMO model. e results also show that the other commonly used techniques are optimization, use case points, function point analysis, machine learning, COCOMO II, and CBR. ere is a small variation in the most common techniques identified based on the analysis of the title, keywords, and abstract of the research papers. e SDECE technique mentioned in the title of the research paper generally indicates that the technique is proposed or used in the research paper for SDECE, whereas the techniques listed in the authors' keywords and abstract of the research paper may indicate that the technique is either proposed/used/referred/discussed in the paper or compared with other existing techniques. erefore, we strongly believe that the title-based text mining approach gives us information about the technique proposed/used by the researcher for SDECE, whereas, the keywords-and abstract-based text mining results give us information about the most discussed/proposed/used/referred technique or it is compared with the other techniques.  However, the top five techniques based on the text analysis of the title, authors' keywords, and abstract of the research paper are the same.

Evolution of SDECE Techniques.
e text-mining results based on the title of the research papers published during the period between 2011 and May 2020 show that ANN, fuzzy approach, optimization, COCOMO, and regression are the most used techniques. e text-mining results based on keywords and the abstract of the research papers show that COCOMO is the most discussed technique. e other widely used techniques are analogy-based approach, use case point, function point analysis, machine learning, COCOMO II, and GA. e text-mining results for the period between 2001 and 2010 show that fuzzy approach, ANN, regression, analogybased, and COCOMO are the most used techniques. It is also observed that regression is the most discussed technique based on text analysis of the abstract. e title-based text mining results show that fuzzy-based approach is the most used technique, which is followed by ANN, regression, analogy, and COCOMO. e other widely used techniques during this period are clustering, function point analysis, soft computing, GP, machine learning, use case point, and COCOMO II. e text-mining results for the period between 1991 and 2000 show that COCOMO and function point analysis are the most used techniques followed by analogy, CBR, ANN, regression, fuzzy logic, and ML techniques. As there exist very few select research papers (21) in our study for the period between 1981 and 1990, the WordCloud does not have a large set of words. However, the results show that COCOMO was the most popular method during this period. We did not apply text mining to the research papers published during the period between 1971 and 1980 because out of the total 1015 selected papers, we had only two papers for that period. e number of research papers published during this period may be large but we found only two papers in the set out of the selected 1015 papers. us, the text mining results show that (i) for the initial period between 1981 and 1990 focus of the research was on calibration and productivity and COCOMO was the most used technique; (ii) during the period between 1991 and 2000 COCOMO became more popular and researchers also proposed functions point analysis, regression, analogy-based approach, CBR, and fuzzy-based techniques; (iii) fuzzy approach, ANN, and regression-based approaches became more popular during the period between 2001 and 2010 followed by COCOMO, analogy-based approach, clustering and ML techniques; (iv) during the period between 2011 and 2020, ANN-based approach was more popular followed by fuzzy logic, optimization, COCOMO, regression, and analogy based approach. Use case point, function point analysis, and machine learning were other popular techniques during that period; (v) for the last fifty years, i.e., for the period between 1971 and May 2020, the most popular technique based on analysis of the studied research papers are fuzzy logic, ANN, regression, analogy-based approach, COCOMO followed by optimization, use case point, function point, ML, COCOMO II, clustering, and CBRbased approaches.
We can map the evolution of SDECE techniques with the evolution of the programming languages. In the initial period (1970 to 1990) COCOMO was the most popular model because software systems were being developed using the assembly and procedure-oriented programming languages and COCOMO model is based on a number of lines of code written to develop the software system.
In the later stage (1991-2000), function point analysis, regression, analogy-based approaches became more popular because by that time a large number of software projects data was recorded and available for the estimation of the newly developed software systems. Regression is a statistical technique, which is used to estimate efforts and cost using historical software projects data, whereas in analogy-based approach, efforts are estimated by considering efforts required for similar systems/projects developed in the past. Function point analysis also became more popular because software systems were being developed using functional programming languages.
Later, during the period between 2001 and 2010, fuzzy logic and ANN techniques became more popular. Fuzzy logic was popular because it takes into account vagueness and imprecise information, and ANN was popular because some researchers were of the opinion that ANN gives more accurate estimation than the existing techniques. During the same period due to the emergence of machine learning techniques and the availability of the existing projects' data, people also started using different ML techniques for effort and cost estimation. Since researchers started using ML techniques, optimization also became more popular as it helped in selecting the most appropriate features. Also, as systems were being developed using object-oriented programming language, scholars started using use case point techniques for SDECE.
For the period between 2010 and 2020, researchers started using existing techniques in combination with the other existing techniques for better estimation. In the recent past, scholars have used deep learning techniques for prediction in other domains, but there is a lack of research in using deep learning techniques for SDECE. erefore, we recommend that scholars should explore this option for SDECE.
We have also identified the most common datasets and accuracy measures used for SDECE. As researchers rarely use the dataset name and accuracy measures in the title of the research article, finding the most frequently used datasets and accuracy measures by applying text mining to the title of the research papers is difficult. However, researchers use the dataset name and accuracy measures in authors' keywords and abstract of the research papers. erefore, we could find the most frequent datasets and accuracy measures by applying text mining on authors' keywords and the abstracts of the research papers. e most frequently used datasets and accuracy measures for each decade and for the period between 1974 to May 2020 are also given in Table 3. It is observed that (i) NASA and ISBSG are the most used datasets; and (ii) MMRE, MRE, and PRED are the most used accuracy measures for the period between 1970s to 2020. Using the text mining, we have also identified whether focus of the research was on the SDEE or SDCE. e results of the text mining for the same are presented in Table 4. e results show that (i) for the initial period from 1981 to 1990 focus of the research papers was on the cost estimation; (ii) for the period between 1991 to 2000 and 2001 to 2010 the focus was on cost estimation followed by effort estimation; (iii) for the period between 2011 and 2020 most studies discussed effort estimation than the cost estimation; (iv) for the past five decades, from 1974 to May 2020 most studies discussed effort estimation than the cost estimation. However, it is important to note that some researchers use these two terms interchangeably.

Bibliometric Analysis.
In this section, we present the bibliometric analysis of select 1015 research papers published during the period between 1974 and May 2020 to address the research questions from RQ4 to RQ7.
RQ4: What is the distribution of SDECE papers and its citations by document type? e distribution of the selected papers by document type is given in Table 5.
e contribution of the journal and conference documents is 39.11% and 59.41%, respectively. However, in terms of the number of citations, journal papers received more citations (68.62%) as compared to the conference papers (31.04%). e contribution of the books and book chapters in terms of the number of papers as well as citations is very less.
RQ5: How many research papers are published on SDECE each year and each decade since 1970? e distribution of papers published in the last five decades is given in Table 6. It is observed that about 92% of the papers were published in the last 2 decades. Figure 3 shows the graph of a number of papers published in each year since 1974. As we have included paper till May 2020, the number of papers published in the year 2020 is less as compared to the year 2019.
RQ6: What is the distribution of citations of SDECE papers?
is research question is further divided into five subquestions as follows.
RQ6.1: What is the distribution of journal and conference papers with zero citation and with one or more than one citation? e number of citations of the research paper plays an important role in deciding the influence or impact of the research paper. Based on articles selected for this study, the count of citations for journal and conference articles is given in Table 7. It is observed that 22.36% of the papers have received zero citations. e proportion of the journal and conference articles having zero citations is 19.89% and 23.21%, respectively. RQ6.2: What are highly cited papers? We have identified highly cited papers using an average annual number of citations received by the paper per year since its publication. e top 5 articles based on the average annual number of citations are shown in Figure 4. e top five articles are published in the journal. It is also found that out of the top 10 articles, all articles were from the journal except one at the seventh position.
RQ7: Who are the top authors in terms of the number of papers and number of citations? e contribution of authors is measured using two metrics: (i) number of articles published by the author and (ii) number of citations received by the author for all his articles selected in this study. e top ten authors based on these two metrics are given in Table 8. We have also created a WordCloud of authors using the authors column of the CSV file of select 1015 papers. e resulting WordCloud (Figure 5) of the author's contribution based on the number of papers matches with the manual calculations of the number of papers published by the top author (Refer Table 8).
e bibliometric analysis of select 1015 papers shows that (i) impact of journal papers in terms of the number of citations is more than the conference papers, though the number of conference papers is more than the journal papers; (ii) IEEE transaction on software engineering, Information and Software technology, Journal of Systems and Software are the top journal sources for the research on software development effort and cost estimation; (iii) Jorgensen, Boehm, and Shepperd are the most cited researchers whereas Idri, Angeles, and Keung have published the maximum number of research papers on SDECE.

Validation of the Framework
In this section, we validate the results of the proposed automated text mining framework by comparing it with the (i) results/outcome of the comprehensive systematic literature reviews (SLRs) done in the past; and (ii) results obtained manually by reading the title of all selected 1015 research papers.

Validation Using past SLRs.
A summary of the results based on five selected comprehensive systematic literature reviews conducted in the past is shown diagrammatically in Figure 6.

Validation by Reading Title of the Research Papers.
e results obtained manually by reading the title of the selected 1015 research papers are shown in Figure 7. e results show that fuzzy logic, analogy-based approach, ANN, regression, and optimization techniques are the most used techniques for SDECE. We did not validate the most-used datasets and accuracy measures by reading the title of the research papers because usually it is not mentioned in the title of the research paper.
us, the careful examination of the results obtained manually by reading the title of research papers and the outcome of the five selected SLRs shows that these results are almost similar with the results obtained using the proposed text-mining approach. erefore, we strongly believe that the proposed automated text-mining framework is very useful to investigate research trends in an identified research area and makes our job easy compared to amount of time and efforts required to do so by systematic literature review method.

Discussions
We have used the following search strings to search literature from the Scopus database: "software effort estimation" OR "software cost estimation". e search string was designed by considering the objectives and research questions of the study. As papers are selected only from the Cost estimation, effort estimation Cost estimation   [86,87]. Further, we did not apply any exclusion criteria on the searched results except that the paper should be written in English language, and the focus of the study should be on SDECE. erefore, we believe that there was no bias in the paper selection process. When we checked the final list of selected papers  1974  1976  1978  1980  1982  1984  1986  1988  1990  1992  1994  1996  1998  2000  2002  2004  2006  2008  2010  2012  2014  2016  2018  2020 Conference Journal Total     and found that all the selected papers were relevant to meet the objectives of the study. We strongly believe that our analysis results would not be much different had we included more papers from the Scopus or other indexing databases because the number of papers selected in the study is 1015, which is a very good number to achieve research objectives. Another threat to the study with respect to using text mining for investigating the most popular technique in each decade is that if scholars use the name of the technique in a slightly different way than the usual one, then that technique would be treated as a different one. However, in most manuscripts, the techniques are named/referred in the same manner barring a few cases. erefore, that will not affect much in capturing the overall research trend.

Conclusion
In this research article, a generic automated text mining framework is proposed to investigate the research trends by analyzing the title, keywords, and abstract of research papers published in an identified area. e proposed framework is used to investigate research trends by analyzing select 1015 research papers published on SDECE in the last five decades. It is found that fuzzy logic, artificial neural networks (ANN), regression, analogy, and COCOMO are the most popular techniques followed by use case point, function point analysis, and machine learning-techniques. e NASA and ISBSG are the most used datasets while MMRE, MRE, and PRED are the most used accuracy measures. It is observed that there is a lack of research on using deep learning techniques for software effort and cost estimation. erefore, we recommend research scholars to explore deep learning techniques for software development effort and cost estimation. e analysis is also carried out to investigate the most used techniques, datasets, and accuracy measures in each decade to understand how SDECE techniques have evolved in the last five decades. e results of the proposed framework are validated by comparing it with the outcome of previously published review work, and we have found that the results are consistent. erefore, the proposed text mining framework is beneficial for futuristic study and can reduce the efforts required to investigate research trends on the topic of an identified research area. To uncover research trends, we have   analyzed the titles, keywords, and abstracts of the research papers separately and found that there is no significant difference in the outcome except slight change in the rank of the most popular SDECE techniques. e detailed bibliometric analysis is also performed along with the metareview of the survey papers, which aids to determine the most relevant papers, venues, authors, and contributions of researchers in the field of the proposed research. A study is recommended to uncover the research patterns and trends by analyzing numerous research papers collected from different electronic databases as this study is limited to research papers collected only from the Scopus database.
Data Availability e data are available for the experimental study.

Conflicts of Interest
e authors have nothing to declare as conflicts of interest with respect to this manuscript.