TSHD: Topic Segmentation Based on Headings Detection (Case Study: Resumes)

. Many unstructured documents contain segments with specifc topics. Extracting these segments and identifying their topics helps to access the required information directly. Tis can improve the quality of many NLP applications such as information extraction, information retrieval, summarization, and question answering. Resumes (CVs) are unstructured documents that have diverse formats. Tey contain various segments such as personal information, experience, and education. Manually processing resumes to fnd the most suitable candidates for a particular job is a difcult task. Due to the increased amount of data, it has become very necessary to manipulate resumes by computer to save time and efort. Tis research presents a new algorithm named TSHD for topic segmentation based on headings detection. We apply the algorithm to extract resume segments and identify their topics. Te proposed TSHD algorithm is accurate and addresses many weaknesses in previous studies. Evaluation results show a very high F1 score (about 96%) and a very low segmentation error (about 2%). Te algorithm can be easily adapted to deal with other textual domains that contain headings in their segments.


Introduction
Most people create resumes containing information to describe and highlight everything they have ever done, in order to get a suitable job. Big companies receive a large number of resumes every day, which makes choosing the right candidate, for a specifc job from a large number of applicants, very hard. It requires very often an expensive process, signifcant time, and human efort to do this task manually by employees.
Some companies have created electronic forms to be flled out by the applicant [1]. But this solution requires some efort by the applicant, in addition to the problem of incorrect flling of the required data. Hence, there is still a need to develop new methods for processing resume documents and extracting their important information automatically. Tis can be done by applying new methods and techniques, such as those found in the felds of natural language processing (NLP), text mining, and machine learning.
Resumes are unstructured textual documents [1], that follow diferent templates and formats. Tey include segments, such as Personal Information, Experience, Education, Skills, and others. Te number and order of segments vary from one document to another.
Te process of extracting segments with specifc topics from the text is called segmentation or topic segmentation [2].
Te highly efcient extraction of segments from unstructured resumes and structuring them in an appropriate way is an important challenge. It has a great impact on improving the performance and accuracy of information extraction from resumes. In addition to allowing direct access to the required information from specifc segments, and reducing the search time, it plays an important role in improving the quality of many resume-processing applications. Topic segmentation of resumes also facilitates the exploration and analysis of resumes information, and the comparison of necessary information between diferent resumes.
Te importance of topic segmentation is highlighted in improving the accuracy of extracting the required information from specifc segments instead of searching within the entire document [2]. For example: to extract the university from which the resume owner graduated, searching within the entire resume for a university may lead to identifying the university in which he works, or a university journal in which he published his research. Tus, searching within a specifc segment with a known topic contributes to obtaining the required information with greater accuracy.
Te research questions for this study are as follows: (i) How to exploit segment headings to extract document segments efciently? (ii) Can segment headings be used to identify the segment topic? (iii) Can we structure the extracted segments with their topics in a suitable format?
Tis research aims to present a new algorithm for extracting and structuring document segments and identifying their topics with high efciency. Te algorithm has been applied in the resume domain. Additionally, many problems in previous studies have been addressed by the proposed algorithm.
Te main contribution of this paper is the proposition of a new NLP algorithm named TSHD to extract document segments and identify their topics based on headings detection. Te proposed algorithm is not afected by the difference in document templates, segment order, and font style. It can also be used to improve the quality of various NLP applications such as information extraction based on the extracted segments.
Te rest of this paper is organized as follows: Section 2 gives an overview of related work. Section 3 presents the proposed algorithm. Evaluation results and discussions are given in Section 4. Finally, Section 5 provides conclusions and future work.

Related Work
For topic segmentation, many methods have been developed over the past years. Some methods deal with domainindependent documents, and others deal with domainspecifc documents such as news articles, Wikipedia pages, novels, and resumes.
TextTiling algorithm introduced by Hearst in [3] makes use of lexical frequency distributions to determine the similarity and correlation between adjacent text blocks using the vector space model and cosine similarity. Te C99 algorithm introduced by Choi in [4] uses a ranking scheme and the cosine similarity measure in formulating a similarity matrix of all sentence pairs. Ten, it determines the location of topic boundaries by clustering. Te U00 algorithm introduced by Utiyama and Isahara in [5] is a statistical method that fnds the maximumprobability segmentation of a given text using dynamic programming. Tis method is more accurate than the C99 algorithm.
Pethe et al. in [6] presented a method for text segmentation in novels. Tey used a hybrid approach combining neural inference and regular expression-based rule matching to recognize chapter title headers in books and achieved an F1 score of 0.77 on this task. Tey presented cutbased and neural methods for chapter segmentation and achieved a low F1 score of 0.453.
Many studies proposed methods for topic identifcation, to identify the topics included in documents, or identify the topics of predefned segments. Topic modeling is one of the most popular ways of topic identifcation, which deals with domain-independent documents. Several models have been developed for topic identifcation using topic modeling such as latent Dirichlet allocation (LDA) [7] and structured topic model (STM) [8]. TopicTiling introduced by Riedl and Biemann in [9] is a modifcation of the TextTiling algorithm and makes use of LDA for topic modeling.
In the resume processing domain, many studies have several objectives for resume processing such as recruitment systems [10,11], information extraction [12], resume classifcation [13], and resume ranking [14]. Most studies extracted information from the whole document, which negatively afected the accuracy of results because of the inaccessibility of the required information directly. Because of that, new studies are more interested in topic segmentation to improve the accuracy of many applications such as those previously mentioned.
In the following, we review studies that presented methods for resume segmentation, explain their methods, and clarify their weaknesses.
Kessler et al. in [15] indicated during their research, which represents an automatic recruitment system, that segmentation of text is a real issue, so they choose to use an existing tool called wvWare to split the paragraphs of the MS Word document. Ten, they apply a corrective process in order to assign a correct label to each segment using Support Vector Machines (SVM). Te main disadvantage of this system is the time cost resulting from the corrective process.
Sanyal et al. in [16] applied a simple segmentation stage. Tey use a database or a data dictionary to hold the keywords or headings they fnd common in most of resumes. When a new resume is taken, a parser searches for the keywords and extracts all the data between them.
Yu et al. in [17] applied some machine learning techniques to extract information from resumes in two stages. In the frst stage, a resume is segmented into consecutive blocks attached with labels indicating the information types (general information segments) using Hidden Markov Models (HMM). Ten in the second stage, some detailed information is extracted from the general information segments using Support Vector Machine (SVM). Tey evaluated the cascaded hybrid model with 1,200 Chinese resumes. In general information extraction, HMM achieved precision � 75.95% and recall � 75.89%, and SVM achieved precision � 80.95% and recall � 72.87%. Tis research supposes that using HMM to extract segments depends on the occurrence of general information in a fxed order. Tis hypothesis is not appropriate because the segment order often difers from one resume to another, and leads to the propagation of error to the second stage.
Reza and Zaman [18] proposed a way to extract information from resumes by converting them from PDF to HTML using a Python library called urllib, then reverse engineering the HTML fles to HTML code using a library called beautiful soap. An HTML code carries information about fonts like font size and font style. From this information, they try to detect segments. But this is not always appropriate because the font in resumes does not follow a specifc standard. Terefore, some already existing resume headings were used by the segmentation process to enhance segmentation results. Te weakness of this research is that the domain was kept restricted to the resumes of only engineering students and the amount of sample data versus the amount of test data was relatively small (to train and test the system only 50 resumes have been used) and their system achieved an accuracy of 80%-85%. In addition, resumes with some varied layout designs are not handled by their method.
Naive Bayes, C4.5, Random Forest, stacking, and conditional random feld (CRF) are tested to classify resume lines into one of two classes, topic boundary, and nontopic boundary [19]. Te corpus was portioned into three datasets, training set (259 CVs), developer set (65 CVs), and test set (109 CVs). CRF achieved the best results compared to the other classifers (precision � 80%, recall � 50%, and F1 score � 62%). But they do not address the variation of heading labels of the same segment between several resumes. Tus, the inability to extract similar segments between resumes causes the inability to access information from known topic segments directly.
Gunaseelan et al. in [20] proposed a way to extract segments from resumes using supervised machine learning by training and testing several classifers to predict whether a text line in a resume is a heading or not. After segmentation, they identify the topic only for the skills segment based on approximate string matching algorithm fuzzymatching. XGBoost classifer outperformed the other classifers used. It achieved precision � 91.4%, recall � 89.8%, and F1 score � 90.1%. Tey have excluded the resumes containing text in some formats like lists and tables because it causes certain errors.
In this paper, we present a new algorithm for topic segmentation of documents with high efciency. Te algorithm has been applied in the resume domain. Compared with the related works, the proposed algorithm addresses several weaknesses in previous studies since it deals with resumes of people with diferent specialties and is not affected by the diference in resume templates, segment order, and font style; moreover, it applies topic identifcation to identify segment topics. Te algorithm can be easily adapted to deal with other textual domains that contain headings for their segments.

TSHD Algorithm
TSHD stands for topic segmentation based on headings detection. Its main goal is extracting document segments and identifying their topics based on headings detection. TSHD uses the python NLTK package [21] to implement some NLP tasks. We will start by reviewing (in Figure 1) the general architecture of the algorithm, followed by an explanation of each of its stages in detail. At the end of this section, an illustrative example will be presented.
Te proposed algorithm consists of the following three main stages: (i) Te frst stage: preprocessing Preprocessing is the frst stage of the algorithm, in which raw data (resumes) is preprocessed to prepare the data for the next stage. Tis stage consists of a series of six sequential steps (more details in Section 3.1). (ii) Te second stage: headings detection At this stage, the locations of segment headings are determined, and similar segment heading labels are unifed. Tis is done by doing two consecutive linear scans: cue phrases scan and then cue words scan (See Section 3.2). (iii) Te third stage: segmentation Segmentation is the last stage of the algorithm, where segment headings and their contents are extracted and structured as JSON pairs.

Preprocessing.
At this stage, raw data is preprocessed and refned, in order to be prepared for the next stage. Te steps of this stage are shown in Figure 2.

Data Transformation.
It is the process of converting text documents (resumes) from their various formats to text format, in order to be processed by computer. For example, converting docx to txt using the python package "docxpy" [22].

Lines Tokenization.
It is the process of dividing the text into a set of lines and storing them in a list. At this stage, we will treat the document as a list of lines.

Data Cleaning.
It is the process of removing useless and misleading data, such as punctuation, bulleted and numbered lists, multiple spaces, and blank lines.

Normalization.
It is the process of converting the entire text letters into a uniform case (lowercase), in order to standardize the diferences between letter cases.

Lines Enumeration.
In this step, an enumerate (denoted as E) is created from the list of lines resulting from the previous steps. It records pairs of (index, line) for all resume lines.

Lines Refnement.
It is the process of removing nonheading lines from the enumerate E to create a refned enumerate of potential headings while keeping line indexes Advances in Human-Computer Interaction as they are in E. Tis aims to eliminate long lines from E because they are less likely to be potential headings.
Resumes include many long lines within their segment content, which may contain segment-heading keywords. Let us look at the following section of a resume: With over 6 years of experience in application developing for international companies. Experience. Employment at Maryland University. Teaching Assistant in the Department of Computer Science from 1/1/2017 Notice the keyword "experience," which is a normal word in the frst line, but indicates also a heading of the experience segment. Tis word may cause a problem if the frst line was indicated as the heading of the experience segment. To solve the problem, all lines with a word count higher than a predefned threshold denoted in this paper as T (maximum length of potential headings) are removed from E. By experimentation and testing, the most appropriate T should be chosen wisely (will be discussed later in Section 4.3). Segmentation results can be relatively improved, by applying more than one scan to the resume and increasing the threshold value in each scan. But this has a negative impact on computational and time processing costs.
Te output of preprocessing stage is a refned enumerate, which contains a set of potential headings with their indexes after it has been cleaned, normalized, and refned. Ten it passes as input to the next stage (Headings Detection Stage).

Headings Detection.
At this stage, segment headings and their indexes are detected based on cue words/cue phrases. Ten, similar heading labels are unifed in order to identify the common segment topics between diferent resumes. Researchers in [23] show that there are some words and phrases indicating changes in the topic, and can be considered as indicators of text segmentation to a set of segments, each with one topic. Tey are called cue words/cue phrases. In this paper, cue word refers to a keyword that is more likely to appear in segment headings. In the same way, a cue phrase refers to a phrase that is also more likely to appear in segment headings. As shown in Figure 3, the headings detection stage is divided into two consecutive linear scans.
(i) First scan (cue phrases scan): segment headings are detected based on whether they contain a cue phrase. (ii) Second scan (cue words scan): segment headings are detected based on whether they contain a cue word.
Cue words scan alone is not enough because some segment headings do not contain cue words that can be used to identify these segments. Examples: personal data, professional background, and others. In addition, there are common and frequently used cue phrases. Example: work experience and educational qualifcation.
Furthermore, the point of applying two consecutive scans and applying cue phrases scan frst-rather than searching for cue words/cue phrases in one scan-is because cue phrases scan results are more reliable. Terefore, we made its results unmodifable by the second scan. Ten followed by cue words scan to identify the remaining segment headings.

Headings Detection Main Tasks.
Heading detection involves many tasks applied during the frst and second scans for cue word/cue phrase detection.

(i) Word Tokenization
Te process of splitting text into tokens, and storing them in a list. At this step, each line will be tokenized (3) (1) into a list of words using the "word tokenizer" provided by the NLTK package.

(ii) Stemming
In order to standardize the diferences that may appear in some headings, "PorterStemmer" provided by NLTK is used to extract the roots of words, e.g., Skills word is returned to its root "skill"; education and educational words are returned to their root "educ," etc. (iii) N-gram tagging At this stage, within the frst scan "cue phrases scan," an n-gram for n � 2, called 2-gram or bigram, is applied to generate all pairs of adjacent words, in order to match and identify segment headings that contain a cue phrase. (iv) Headings unifcation Heading labels of the same segment may difer from one resume to another, e.g., Education, Academia, Academic Qualifcations, Academic Background, Educational Qualifcation, Educational Background, and Academic Credentials. All these labels refer to one segment heading, which is Education. Te lack of standardization of these labels leads to the inability to explore similar segments between multiple resumes, and hence the inability to extract information from specifc segments, rather than searching within the entire document. At this task, topic identifcation of segments is done by unifying similar segment headings. Headings representing segments about the same topic are grouped together in "unifed heading" groups (personal_info, experience, education, skills, certifcations, languages, awards, interest, summary, goal, military_service, additional_info, and others). For the previous example, all labels are assigned to the unifed heading "education." Te unifed heading and its location are added to a Headings Dictionary when any of its synonyms are detected during scanning. Te Headings Dictionary consists of {key : value} pairs and stores the unifed headings and their locations {unifed_heading: line_index}. Also, for headings that rarely appear or have few synonyms-which is useless to unify their synonyms-they are referred to as "others," such as references, projects, and membership, and add their root as a heading to the Headings Dictionary. Tables 1 and 2 show the most commonly used cue words/cue phrases, collected from various sources. Teir synonyms are grouped and unifed. Te roots of words are stored for the purposes that were previously explained.
Te algorithm can deal with other textual domains that contain headings for their segments such as scientifc research publications, reports, online articles, and memorandums. Tis can be done by adapting Tables 1 and 2 with cue phrases and cue words belonging to another domain.

Dealing with Special Cases.
In the following, it is shown a number of special cases that are likely to appear in various resumes and the way to deal with them.

(i) Subheadings
Segment contents may include some subheadings that may have the same cue word/cue phrase found in the segment heading, which should not be identifed as a segment heading. To illustrate, look at the following section of a resume:

Skills
Programming Skills: Python-Java-C++ Software Skills: Linux-Windows Notice that the cue word "skills" appears three times. Te frst occurrence represents the segment heading, while the next two occurrences represent subheadings within the segment content. Tis is resolved, by recording the frst occurrence of cue words/cue phrases and making it unmodifable if it appears again later. (ii) Recurrence of cue words/cue phrases or their synonyms Cue words/cue phrases that identify a segment heading or one of their synonyms may recur within the segment content. Te example is as follows: Experience Employment at Maryland University Teaching Assistant in the Department of Computer Science from 1/1/2017 Notice that the cue word "employment," which is one of the synonyms that identify the heading of the experience segment, appears within the content of the experience segment, assuming that the line containing it-a short line-is not excluded in the lines refnement process. Tis is resolved, by taking Advances in Human-Computer Interaction advantage of unifying similar headings, recording the frst occurrence of cue words/cue phrases, and making them unmodifable if any of their synonyms recur later. (iii) Composite segments Some resumes may include segments with close topics, e.g., Certifcations, Awards, Honors, and Achievements. Some resume owners may group them into one composite segment such as "Certifcations and Awards," "Awards/Achievements," or "Honors, Certifcations, and Awards." Such cue words should not be considered as synonyms of a single segment heading and unifying them because they may appear as independent segments. Te following section of a resume represents a composite segment: Certifcations and Awards. International Computer Drivers License (ICDL) ACM ICPC 2019 Gold medal.
In this case, an error will occur if these cue words are identifed as diferent segment headings, e.g., "Certifcations" is identifed as a segment heading and "Awards" as a new segment heading. To avoid this problem, the index of the segment heading is checked to assure it is not already added to the Headings Dictionary.

Headings Detection
Algorithm. Te refned enumerate generated by the previous stage is passed as input to this stage. At this stage, the Headings Dictionary is passed to the frst scan and then to the second scan in order to identify headings according to the linear scan order.
(1) First scan: "cue phrases scan." In this scan, the algorithm identifes the location of headings that contain cue phrases, unifes similar heading labels, and adds the unifed headings with their locations to the Headings Dictionary (See. Algorithm 1 for pseudocode).
(2) Second scan: "cue words scan." In this scan, the algorithm identifes the location of remaining headings that contain cue words, unifes similar heading labels, and adds the unifed headings with their locations to the Headings Dictionary (See. Algorithm 2 for pseudocode).
At the end of this stage, the Headings Dictionary items are resorted in ascending order according to heading locations and then passed to the next stage "segmentation stage."

Segmentation.
Tis is the last stage of the algorithm, where segment headings and their contents are extracted, based on the Headings Dictionary resulting from the previous stage, and the raw lines of resume "line tokens" (before any modifcations are made to them), as shown in Figure 4.
In the following, the method of extracting segment headings and their contents from resumes is presented. Ten, how to implement this stage of the algorithm is explained.
Advances in Human-Computer Interaction 7 specialty or a summary of the resume. Tis part of the resume does not usually have a heading. As in this example.
John Smith Software Engineer With over 6 years of experience in application developing for international companies.
Te frst segment of the resume is extracted by extracting the part between the beginning of the resume and the frst segment heading. Ten, it is added to the personal information segment.

Extraction of Segment Headings and Teir Contents.
Tis is done by associating unifed headings in Headings Dictionary with their contents in line tokens, depending on the locations of these headings that were detected in the previous stage, then extracting the parts between them. Figure 5 shows the method of extracting segment headings and their contents.
At the end of this stage, the unifed segment headings and their contents are stored and structured as JSON pairs. Table 3 presents an illustrative example of a resume and shows the output of each stage of the TSHD algorithm.

Results and Discussion
Tis section presents an evaluation and a discussion concerning TSHD algorithm results after applying it to a real set of resumes. Ten it shows how efective the TSHD is by comparing it with the most prominent previous studies.

Dataset Description.
In order to test the proposed algorithm, 105 resumes (containing 4733 lines after deleting the blank lines) written in English were collected from random websites. Tey belong to people with diferent specialties. Tese resumes are unstructured text documents and follow diferent templates. Teir segments vary in number and order. Also, the style, size, and color of the font do not follow any standards.

Evaluation Metrics.
Tere are two types of metrics used in evaluating segmentation results; classifcation-based and segmentation-based metrics (explained in detail in the following).

Classifcation-Based
Metrics. Segmentation can be considered as a binary classifcation task by classifying all resume lines into the following two classes: (i) Topic boundary: this class represents a line separating two segments with two diferent topics. In this study, it is the segment heading. (ii) Nontopic boundary: this class represents a line within the segment content.
Terefore, the most important classifcation metrics for binary classifcation are [24]as follows: True positive (TP): number of segment headings that are correctly identifed by the algorithm. False negative (FN): number of segment headings that are incorrectly not identifed by the algorithm.
False positive (FP): number of lines that are incorrectly identifed as segment headings by the algorithm.
True negative (TN): number of lines that are not segment headings and correctly identifed by the algorithm.
Classifcation metrics are strict in their decision because they do not consider how close the identifed topic boundary is to the actual topic boundary for the false results.

Segmentation-Based Metrics
For the segmentation task, P k gives the probabilistic error metric in segmentation. It indicates the probability of points (lines) being identifed in the wrong segments. It tests whether or not lines are separated by segment breaks [25]. P k values range from zero to one where smaller values are better. Te value of P k is calculated by passing a constant-sized window of k words across the text. In each step, both ends of the window are tested for hypothesized segments, to determine if they are actually separated by a segment break. Te value of k can be chosen randomly, but in general, P k is evaluated by fxing k to be half of the average reference segment length.  Preprocessing stage output for T � 3 words ((0, "john smith"), (1, "software engineer"), (3, "personal data"), (6, "experience"), (9, "educational qualifcation"), (12, "skills"), (13, "programming skills"), (14, "python-java c++"), (15, "software skills"), (16, "Linux-Windows"), (17, "certifcations and awards")) First scan output ("Personal info": 3, "education": 9) Second scan output ("Personal info": 3, "experience": 6, "education": 9, "skills": 12, "certifcations": 17) Segmentation output (JSON) (ii) WindowDif (WD) Researchers propose the WindowDif metric, a simple modifcation to the P k metric, to avoid several problems [26]. WindowDif moves a fxedsized window across the text and penalizes the algorithm whenever the number of boundaries within the window does not match the true number of boundaries for that window of text. WD values range from zero to one (smaller is better).

Evaluation
Results. Classifcation and segmentation metrics are applied to evaluate TSHD results. Te algorithm is implemented and tested for several values of threshold T (maximum length of lines), which are indicated in the lines refnement stage (See. Section 3.1.6). Tested thresholds are 1, 2, 3, 4, 5, 6, 7, and all lines without refnement. Table 4 and Figure 6 show classifcation metrics results to evaluate TSHD for several thresholds.
Notice that values of F1 score and accuracy are very high for thresholds 2, 3, and 4. TSHD achieves the best results when T � 3 words, with F1 score � 0.964, and accuracy � 0.992. Segmentation metrics P k and WD, are used to evaluate the algorithm for several thresholds. Tey calculate the segmentation probabilistic error. P k is obtained by fxing k to be half of the average reference segment length. WD metric is calculated for the most common window widths: 2, 3, and 4. Results are shown in Table 5 and Figure 7.
Notice that all segmentation metrics achieve a very low segmentation error rate for thresholds 2, 3, and 4. TSHD achieves the best results when T � 3 words. At this threshold P k � 0.022, and WD values equals (0.016, 0.024, and 0.032) for Window widths (2,3,4), respectively. Figure 6, the higher the threshold value, the lower the precision value due to increasing the number of lines that are incorrectly identifed as segment headings (the false positive). Because the longer the line is, the more likely it is to be a segmented content and the less likely it is to be a segment heading; moreover, the lower the threshold value, the lower the recall value because of increasing the number of segment headings that are incorrectly not identifed (the false negative). It is due to the failure to identify segment headings whose length is greater than the specifed threshold. Recall value also decreases when all lines are taken, as a result of not identifying segment headings if one of their synonyms has occurred inside a segmented content before because the algorithm determines the frst occurrence of cue words/cue phrases.

Results Discussion. As shown in
As shown in Table 4, the accuracy metric is very high, which is close to 99%, because of the high value of true negative (number of lines that are not segment headings and correctly identifed by the algorithm). As shown in Figure 7, the diferent values of window width in the WD metric refect the same evaluation results, and increasing the window width does not give better results. Terefore, any value of window width can be used, which gave the best results when window width � 2.
Te lines refnement stage has a signifcant positive efect in improving F1-score results from 74% to 96% (See. Figure 6) and reducing segmentation error from 19% to 2% (See. Figure 7); Consequently, all classifcation and segmentation metrics that have been applied show the efectiveness of the TSHD algorithm. Classifcation metrics show a very high F1 score (about 96%), and  Figure 6: Classifcation metrics results according to diferent thresholds.

Results
Comparison. In the following, the results of the most prominent studies in resume segmentation are presented. Researchers have evaluated their methods after applying them to diferent data sets. It is not possible to compare their results accurately because the same data set is unavailable. However, the TSHD algorithm deals with various forms of resumes, and the data set that is used to evaluate the algorithm does not follow any standards and has been randomly collected from multiple sources. Tus, we will compare the evaluation results of the algorithm with the results of those studies, regardless of any potential considerations in the data set used by them.
Researchers in [17] test HMM and SVM models to extract segments from resumes. On the other hand, researchers in [19] test several classifers and get the highest results using conditional random feld (CRF). Researchers in [20] apply the XG boost classifer in predicting headings. Tey evaluate their results using precision, recall, and F1 score. Table 6 shows their evaluation results in comparison with the results of the TSHD algorithm in this paper for T � 3 words.

Conclusions and Future Work
In this paper, we presented a new algorithm named TSHD for extracting document segments and identifying their topics based on headings detection. We have applied the algorithm to extract resume segments, in order to facilitate selecting suitable candidates for a particular job post. Te study focuses on improving the segmentation accuracy, due to its signifcant impact on improving many NLP applications. Tis helps in accessing the required information from specifc segments directly, instead of searching within the entire document. Also, the algorithm addresses many weaknesses in previous studies.
Te proposed algorithm deals with various forms of unstructured resumes of candidates with diferent specializations. It extracts segments, identifes their topics, and structures them in JSON format. Ten, the structured segments can be used for several purposes, including (i) Improving the quality of many applications that process and analyze resumes with diferent objectives (ii) Saving time and human efort by the direct access to the required information and the ability to compare it (iii) Identify the available information in resumes. For example, reviewing resumes of candidates who have publications and awards (iv) Te ability to reformat resumes with diferent templates and print them using one standard template Te algorithm was evaluated using classifcation and segmentation metrics and tested for several values of threshold (maximum length of lines). Te evaluation results show the efectiveness of the algorithm, which achieves the best results for threshold values 2, 3, and 4. It achieves a very high F1 score (about 96%), and a very low segmentation error (about 2%). Te evaluation results were compared with the most prominent similar studies, which showed the clear superiority of the TSHD algorithm. Te algorithm can be easily adapted to deal with other textual domains that contain headings for their segments. Tis can be done by using cue words and cue phrases belonging to that domain and choosing the appropriate threshold.
In future work, it would be interesting to test our algorithm on diferent domains such as research publications. Furthermore, it will be also interesting to try building the cue words and cue phrases tables automatically using machine learning or deep learning methods. We expect that the high-quality segmentation that is achieved by our algorithm can improve many NLP applications such as information extraction, information retrieval, and others. However, the proposed algorithm is not applicable to documents that do not contain headings for their segments.

Data Availability
All data included in this study are available from the frst author upon reasonable request.