Towards Generating Adversarial Examples on Combined Systems of Automatic Speaker Verification and Spoofing Countermeasure

-e security of unprotected automatic speaker verification (ASV) system is vulnerable to a variety of spoofing attacks where an attacker (adversary) disguises him/herself as a specific targeted user. It is a common practice to use spoofing countermeasure (CM) to improve the security of ASV systems so as to avoid illegal access. However, recent studies have shown that both ASV and CM systems are vulnerable to adversarial attacks. Previous researches mainly focus on adversarial attacks on a single ASV or CM system. But in practical scenarios, ASVs are typically deployed in conjunction with CM. In this paper, we investigate attacking the tandem system of ASV and CM with adversarial examples. -e joint objective function is designed to restrict the generating process of adversarial examples. -e joint gradient of the ASV and CM system is derived to generate adversarial examples. Fast Gradient Sign Method (FSGM) and Projected Gradient Descent (PGD) are utilized to study the vulnerability of tandem verification systems against white-box adversarial attacks. -rough our attack, audio samples whose original labels are spoof or nontarget can be successfully accepted by the tandem system. Experimental results on the ASVSpoof2019 dataset show that the tandem system is vulnerable to our proposed attack.


Introduction
Automatic speaker verification (ASV) aims to extract features from given utterances so as to determine whether the utterance belongs to a specific speaker. ASV is undisputedly a crucial technology for biometric identification, which is broadly applied in real-world applications like access control, military, judicial forensics, and surveillance [1]. However, unprotected ASV systems are vulnerable to a variety of spoofing attacks [2]. In spoofing attacks, the attacker usually disguises himself/herself as one of the enrolled speakers by generating spoofing speech [3,4]. e emergence of spoofing attacks promotes the research of spoofing countermeasures (CM). Whether being independent of ASV or combined with ASV, spoofing countermeasure has become an indispensable part when deploying ASV [5]. In recent years, by following ASVSpoof challenges, the works to address voice spoofing attacks and their defenses have become popular [6][7][8][9][10]. In view of various spoofing scenarios, researchers have proposed lots of effective antispoofing methods [11][12][13][14][15]. e scenarios of both logical access (LA) and physical access (PA) are taken into account in these works. e LA scenario involves fake audios synthesized by modern text-to-speech synthesis (TTS) and voice conversion (VC) models. e PA scenario involves replayed audio signals recorded in reverberant environments under different acoustic configurations. Several teams in ASV-Spoof2019 have achieved excellent performance in detecting spoofing and reinforcing robustness of ASV systems under both LA and PA scenarios. erefore, with the rapid development of spoofing detection, it is becoming common to deploy ASV and CM together.
Adversarial attacks have potential threats to all types of machine learning models [16][17][18][19], so they have attracted a lot of attention in different classification tasks. According to whether the attacker has the internal information of ASV (including model structure, parameters, loss function, and gradient information), adversarial attacks can be divided into white-box attack and black-box attack [20]. In general, white-box attacks have a higher success rate, but black-box attacks are more in line with realistic attack scenarios. In recent years, preliminary progress has been made in adversarial attacks on ASV and CM. Researchers conducted white-box attacks [21][22][23][24][25][26][27] or black-box attacks [26][27][28][29][30] on common ASV models. In [31,32], the vulnerability of the CM system against adversarial examples is also investigated.
Although there have been various works in the field of adversarial attacks on ASV or CM, as far as we know, adversarial attack research on the tandem system of ASV and CM has not yet appeared [2]. As ASV is usually utilized in combination with CM in real scenarios, it is necessary to study the adversarial attack in this kind of tandem system. In the tandem system, ASV and CM systems are trained independently and combined during the validation phase. In order to measure the performance of the tandem system, the tandem detection cost function (t-DCF) is proposed in [33,34]. e calculation of t-DCF utilizes different kinds of errors generated by two subsystems and assigns different costs to these errors. In this paper, our goal is to enable utterances that should have been rejected by the tandem system to be accepted after an adversarial attack. Because there are two independent subsystems in the tandem system, it is necessary to consider the gradient and loss of subsystems in the generation of the adversarial examples so that adversarial utterance can deceive both ASV and CM systems.
In this paper, we implement the tandem verification system of ASV and CMs on the ASVSpoof2019 Challenge dataset. e method of attacking the tandem system is also proposed. To the best of our knowledge, it is the first work to study adversarial attacks in the tandem system of ASV and CM. Our contribution is as follows: (1) For the tandem system of ASV and CM, a parallel branch structure is designed to derive the joint target function. (2) e joint adversarial gradient derived from the joint target function is utilized to generate adversarial examples. (3) Compared with the step-by-step attack, the joint adversarial attack method proposed by us is more effective.
e remaining part of the paper is organized as follows. e related works about adversarial attacks on ASV and CM are introduced in Section 2. e models of ASV and CM and their combination are introduced in Section 3. e algorithm for generating adversarial examples in the tandem system is proposed in Section 4. e settings and results of experiments are reported in Section 5. e summary and discussion are given in Section 6.

Related Works
In this section, we introduce the preliminaries of adversarial attacks on ASV and CM, respectively.

Adversarial Attacks on ASV.
Study [26] shows that endto-end ASV systems are vulnerable to adversarial attacks. Adversarial examples are generated by adding a perceptually indistinguishable structured noise to the original test examples. is is the first work in the field of ASV adversarial examples. Fast Gradient Sign Method (FGSM) is utilized to carry out white-box and black-box attacks in a cross-corpora and cross-feature setting. Another recent study [22] investigates the vulnerability of the Gaussian Mixture Model (GMM) i-vector-based ASV under adversarial attacks. e transferability of adversarial examples from one ASV to another is also evaluated in this work. "FakeBob" addressed in [30] investigates the impacts of threats generated by practical black-box attacks.
is study considers different cases for practical scenarios, including various ASV architectures of commercial systems, transferability of attacks, practicality of over-the-air through replay, and imperceptibility based on human perception. Further studies have also explored real-time, practical, and robust adversarial attacks. e estimated room impulse response (RIR) is integrated into the adversarial example training process [25,28].

Adversarial Attacks on CM.
Unlike the adversarial attack research that has been widely explored on the ASV systems, adversarial attacks on spoofing countermeasures have received little attention. A recent work [32] investigates the vulnerability of spoofing countermeasures for ASV under both white-box and black-box attacks with the FGSM and the Projected Gradient Descent (PGD) methods. e performance of black-box attacks across spoofing countermeasure models with different network architectures and different amount of model parameters is compared in this work. It reveals that spoofing countermeasure models are vulnerable to FGSM and PGD attacks under the scenario of white-box attack. e black-box attacks are also proved to be effective. In addition to the work in [32], the work in [31] has also proposed a black-box attack utilizing the transferability of adversarial examples.

Two Subsystems and Their Combination
In this section, the subsystems ASV and CM are introduced. Method of their combination is also introduced.

e Tandem System as the Attack Victim.
Both ASV and CM systems belong to binary classification systems [34]. Each trial of ASV is an enrollment-test pair, where u e is collected at the enrollment phase and u t at the verification phase. If in pair (u e , u t ) the identities of speakers are the same, it is known as a target trial; otherwise, it is a nontarget trial. erefore, H asv 0 (nontarget): id u e ≠ id u t , H asv 1 (target): id u e � id u t , where id(u) ∈ N represents the speaker identity corresponding to utterance u. ASV systems may encounter the intrusion of spoofed trials. erefore, it is necessary to deploy the CM system to reject spoofed utterance. e object of the CM system is to verify the authenticity of test utterance u t . If u t corresponds to genuine speech produced by real human speaker, the trial is referred to as a bona de trial. If u t corresponds to nongenuine, manipulated, or synthesized speech, the trial is referred to as a spoof trial. erefore, Although both ASV and CM systems have the same object of preventing illegal access, they each have speci c goals. ASV system should be able to reject zero-e ort imposters (nontarget speakers), and the CM system should be able to detect spoo ng speakers. e ASV and CM systems play complementary roles, and both are needed to ensure spoo ng-robust ASV.
Traditional fusion systems typically involve two subsystems with the same objectives, such as the fusion of two ASV systems or two CM systems. However, ASV and CM do not have the same objective function, so the ASV-CM tandem system is di erent from the traditional fusion system. e real target speakers' trials should be accepted by both ASV and CM. Cascaded tandem detection framework shown in Figure 1 has shown the potential in previous work [34]. erefore, the cascaded system shown in Figure 1 is chosen as the victim system in this paper. Obviously, it is needed to set thresholds for both CM and ASV modules. e nal decision result will be obtained after comparing scores with two thresholds (i.e., thresholds of CM and ASV subsystems). Trials can be accepted only when scores are not less than both thresholds. For tandem systems, three di erent types of trials will be encountered: (i) target, (ii) nontarget, and (iii) spoof. ere are two nal decisions for the tandem system: (i) accept and (ii) reject. Only trials labeled target should be accepted, and both nontarget and spoof trials should be rejected.

e Model Details of ASV and CM Subsystems.
e deep neural network structure for ASV is presented in Figure 2, which follows the architecture utilized in [35,36]. e mutual-information maximization method is utilized for training the Siamese network. All training procedures only update the back-end of ASV (Siamese and discriminator module), while the front-end feature extraction module remains xed. e green squares in Figure 2 are "Siamese" module, and the orange squares are "discriminator" module. e traditional cosine metric or PLDA scoring at the backend is not utilized in this modi ed structure. Instead, in order to obtain the gradient, a fully connected layer is utilized to measure the similarity of speech features. If both inputs x enrol and x test belong to the same speaker, the score is 1; otherwise, it is 0.
e Squeeze-Excitation Network (SENet) structure for CM is presented in Figure 3 and Table 1, which follows the SENet34 architecture proposed in [13]. e system proposed in [13] is ranked 3rd and 14th places for the PA and LA scenarios, respectively. SENet adaptively recalibrates the channel feature responses by explicitly modelling the dependencies between channels, which has shown great advantages in image classi cation tasks [37]. e use of SENet has also achieved excellent results in the eld of CM.

Adversarial Attack Methods.
Given audio sample x, the goal of the attack is to generate a perturbed audio signal: where θ is the parameters of the model that has been xed, x is the perturbed audio, l is the original label of x, L is the loss function, and ε is the upper limit of perturbation. e goal of the adversarial attack is to lead the classi er to misclassify x.
If the real label of audio sample x is l, then after the adversarial attack, the classi er will identify the label of x as l, and l ≠ l. In an adversarial attack, it is necessary to ensure that the perturbation is imperceptible enough so as to make it di cult for humans to distinguish between x and x. e value of p is generally 2 or ∞, and p ∞ in this work. In order to solve the above optimization problems, Fast Gradient Sign Method (FGSM) [17] and Projected Gradient Descent (PGD) [38] are utilized in our paper.
(1) FGSM. FGSM is a single-step attack method with high computational e ciency. e main idea is to extract the sign of the gradient function to generate adversarial examples. Loss will increase by moving along the gradient direction. e perturbed signal generated by FGSM is as follows: (2) PGD. PGD is a method of generating adversarial examples through iteration. e attack success rate of PGD is higher than FGSM, but it also consumes more computing resources. First, initialize x 0 x, and then the audio after each iteration is where 0 ≤ k ≤ K and K is the maximum number of iterations. α is the step-size of the gradient descent update. e function clip is utilized to clip the perturbation to satisfy

Adversarial Attack in Tandem
System. Previous work on adversarial attacks has been done in ASV or CM subsystems, respectively. In this paper, adversarial attacks are conducted against the CM-ASV tandem system. We propose that the joint gradient is utilized to generate adversarial examples of the tandem system so that ASV and CM systems can be deceived simultaneously. In order to derive the joint objective function utilized to generate adversarial examples, a parallel decision structure is proposed, as shown in Figure 4. In fact, there are two methods to combine CM and ASV, including a cascade one and a parallel one, whose decision principles are similar to that of victim systems in Figure 1. e tandem system accepts the input utterance only if both ASV and CM systems accept it. erefore, our design can be utilized not only for parallel systems but also for cascaded systems. e input features for both ASV and CM in our algorithm are uni ed, which simpli es the computation of joint gradients regarding features. e CM and ASV subsystems adopt the network structures introduced in Figures 2 and 3. In order to ensure the additivity of adversarial gradients generated by ASV and CM, the input features of subsystems are uni ed to Log Power Spectrum (LPS).
In FGSM and PGD, both the loss function and gradient information of the target system need to be obtained. e simplest method is to add perturbation to the original utterance against CM and then ASV (or swap the order). However, the perturbation added later will override the perturbation added earlier, making it impossible to deceive both systems at the same time. e analysis will be shown in Section 4. For these reasons, the joint loss function is introduced:   Figure 2: e neural network architecture for ASV. Dashed lines indicate shared parameters. All hidden layers use ReLU activation. Green boxes represent "Siamese" modules and orange "discriminator" modules.
� αCE l asv ,f(x) + βCE l cm , g(x) , � α where L ASV , L CM , and L total are loss functions of ASV, CM, and tandem system. CE is the cross-entropy loss function. l asv (target � 1; nontarget � 0) and l cm (bonafide � 1; spoof � 0) are labels of ASV and CM. f(x) and g(x) are models of ASV and CM. α and β are weights of ASV and CM; in this paper, α � β � 0.5, and α + β � 1. L ASV and L CM are obtained by the subsystems shown in Figures 2 and 3, respectively. During the generation of adversarial examples, inputs are original utterances, labels for ASV and CM, and claimed identity. Since L total is a function of x, l asv , and l cm , it can be represented as F(x, l asv , l cm ). e gradient of the joint loss function can be calculated as follows: Substituting formulas (6)-(10) into the FGSM or PGD algorithm, the adversarial examples of the tandem system can be obtained.

Datasets and Metrics.
is work utilizes the ASV-spoof2019 dataset, which encompasses partitions for the assessment of LA and PA scenarios. LA implies a scenario in which a remote user seeks access to a system or service protected by ASV. An example is a telephone banking service. In this scenario, attackers may connect and then send synthetic or converted voice signals directly to the ASV system while bypassing the microphone, that is, by injecting audio into the communication channel. Attacks in the LA scenario can be generated using the latest TTS and VC technologies. e best of these algorithms produces speech that is perceptually indistinguishable from bona fide speech. In the PA scenario, spoofing attacks are presented to a fixed microphone which is placed in an environment where sounds propagate and are reflected from obstacles such as floors and walls. Implementing a replay spoofing attack requires recording bonafide speech in advance and then playing those recordings back to the microphone of the ASV system with a replay device. In this paper, we only utilize the LA partition.
e dataset provides spoofing samples generated by different spoofing methods, as well as labels of speaker and spoofing method [9,10]. e ASVSpoof2019 dataset is utilized to train the CM system and evaluate the experimental results. e structure of CM has shown in Figure 3. When training the CM system, LPS are extracted according to the speaker list of ASV-Spoof2019. ere are 25,380 utterances in the training set for training the CM model. ere are also 24,844 utterances for development and 71,237 for evaluation. e Blackman window function is utilized to extract LPS with a length of 1724 as features, with a window length of 0.0081s [14]. During the training phase, each training sample consists of a feature and a {0,1} target. If the utterance comes from an imposter, the target is 0; otherwise, it is 1. e network is updated with minibatches of size 64. e maximum iteration round is 100. e training early stops when the classification accuracy rate on the development set does not increase more than 5 iterations. During the training phase, the network is updated with parameters through the softmax and cross-entropy loss functions utilizing Adam. e training batch size is 64, and the weight decay rate is 0.001. It is worth mentioning that, in this paper, adversarial examples are added to the feature domain. Since attackers do not always have access to the feature input interface of models, it is necessary to utilize waveform to attack. During the test phase, the reconstructed adversarial waveforms are utilized to attack the tandem system. e adversarial utterances are reconstructed by combining the phase of the original spectrum with the amplitude of the adversarial spectrum, which is a standard adversarial waveform reconstructed approach when the adversarial attack algorithm is implemented on frequency domain as reported in [22,23,31,32,39].
VoxCeleb1 is utilized to pretrain the ASV. When training the ASV system, LPS are extracted according to the speaker list of VoxCeleb1. e training set contains a total of 1,211 speakers with a total of 148,624 utterances. ere are also 4,874 utterances from 40 speakers to test the performance of ASV. e structure of ASV is shown in Figure 2. During the training phase, each training sample consists of two input features and a {0,1} target. If both features originate from the same speaker, the target is 1; otherwise, it is 0. e network is updated with minibatches of size 64, each containing an equal number of samples with targets 0 and 1 to avoid class imbalance in training. e network parameters are updated to minimize the cross-entropy loss between the sigmoided output of the network and the target labels utilizing Adam. e learning rate of 60 iterations is selected to be 0.001, and the weight of the twonorm regularization is set to 5e − 5 . After training on VoxCe-leb1, the ASV model is fine-tuned on the ASVSpoof2019 dataset with the CM training list (ASVSpoo-f2019.LA.cm.trn.txt) [40] for another 20 iterations. e learning rate is set to 0.0001. Each utterance is reduced by TDNN to become a 512-dimensional vector [41]. During finetuning, TDNN modules in Figure 2 are fixed. Green "Siamese" modules and orange "discriminator" modules are updated.
A normalized version of tandem detection cost function (t-DCF) from [33,34] is utilized to evaluate the performance on attacking the combined system of ASV and CM. e detection threshold (set to the EER operating point) of the Security and Communication Networks ASV system is fixed, whereas the detection threshold of the CM system is allowed to vary. Results are reported in the form of minimum normalized t-DCF values. e normalized t-DCF is defined as a function of the CM threshold, and the minimum normalized t-DCF defined in (11) is finally computed to evaluate the performance of the joint system, t−DCF min norm �t−DCF norm argmin θCM t−DCF norm θ CM . (11) e value of (11) ranges from 0 to 1. e closer it is to 1, the more the errors occurring in the combined system are.
When evaluating the performance of the tandem system, the same cost parameters as minimum normalized t-DCF in the ASVSpoof2019 Challenge are utilized. e threshold of ASV is fixed to its Equal Error Rate (EER) and swept over CM thresholds for minimal normalized t-DCF. t-DCF is utilized only for final evaluation and not for optimization training of ASV or CM systems. When evaluating the attack effect of adversarial examples, the False Acceptance Rate (FAR) is adopted, which is defined as the proportion of speech uttered by imposters (nontarget or spoof ) but accepted by systems. If there is no special notice below, all experiments are tested on the combined protocols of the ASVSpoof2019 dataset (ASVSpoof2019.LA.asv.dev.gi.trl.txt and ASVSpoof 2019.LA.asv.eval.gi.trl.txt).

Experiments Settings. A tandem system of ASV and CM
can be achieved by connecting the individually trained subsystems in the form of Figure 4. e tandem system is attacked by FGSM and PGD. In both FGSM and PGD attack settings, the maximum amplitude of perturbation ε is chosen from the set of 0.1, 1, 5, 10 { }. Since PGD is an iterative algorithm, the relationship between the step of PGD and the maximum amplitude of perturbation is To achieve a valid adversarial attack, in addition to having a high attack success rate, it is also important to make adversarial examples indistinguishable from the original audios to humans. An XAB listening test is conducted to evaluate the imperceptibility of adversarial audios, which is a standard detection method to assess the detectable differences between two choices of sensory stimuli. In the XAB test, the adversarial examples generated by the PGD algorithm when ε � 10 are utilized. Adversarial audios are generated by combining perturbed LPS and the phase of the original utterance. Five listeners were involved in the test, each of whom was asked to listen to 50 randomly selected adversarial-original audio pairs (A and B). An utterance (X) was randomly selected from each pair, and listeners chose whether the utterance was closer to A or B.

5.3.1.
e Performance of Systems without Attacks. e performance of ASV and CM is evaluated under the ASVSpoof2019 protocol separately. e protocol file contains both labels of the ASV and CM. Each trial contains 4 columns, which are claimed speaker ID, utterance ID, CM label (bonafide/A01-A19), and ASV label (target/nontarget/ spoof ). Here, ASV-V represents the model trained on VoxCeleb1, and ASV-S represents the model fine-tuned on ASVSpoof2019. When the thresholds of subsystems are all fixed at the EER point, the FAR of ASV-V, ASV-S, and CM are 9.47%, 6.21%, and 5.43%, and the FAR and t-DCF of the tandem system are 5.67% and 0.023, respectively. When testing the t-DCF of the tandem system, the ASV threshold is fixed; adjust the CM threshold to find the minimized t-DCF, as shown in Figure 5.

Evaluation of White-Box Digital Attacks.
In order to intuitively display the distribution of different kinds of samples, the ASV and CM subsystems were utilized to score samples with different labels. Figure 6 is the score histogram of ASV and CM systems. Figure 6(a) shows the scores of original utterances, and Figure 6(b) shows the scores of adversarial examples. For the ASV system, the adversarial object is to accept all the utterances originally labeled as nontarget. For the CM system, the adversarial object is to accept all the utterances originally labeled as spoof. Comparing Figures 6(a) and 6(b), it can be seen that the scores of some utterances labeled nontarget and spoof are less than the threshold of ASV before the adversarial attack. Also, the score distinction between bonafide and spoof examples derived from the CM system is obvious. However, after the attack, the scores of most examples are higher than the threshold of ASV, and the distinction between bonafide and spoof has become not obvious. e performance of ASV-S, CM, and tandem systems after adversarial attack is shown in Table 2. PGD-100 means that the number of iterations is 100. FAR and t-DCF are utilized to evaluate the performance of adversarial attacks. It can be seen that the PGD method is more effective than the FGSM. At the same time, the higher the upper perturbation limit is, the more effective the attack is. Figure 7 is a t-SNE diagram of audios w/o the adversarial attack. Samples whose Claimed IDs are LA_0073 have been chosen to be shown in the figure. Before the attack, three kinds of samples are clearly distinguishable, and the boundaries of each type are relatively clear. e tandem system can easily distinguish the three types of samples. After the attack, the classification boundary of the adversarial examples is gradually blurred. erefore, the FAR and t-DCF will increase. is shows that the proposed attack algorithm on the tandem system has played its due role and can make subsystems produce misclassifications at the same time. erefore, the utterance whose original label is spoof or nontarget can be recognized as bonafide by CM and target by ASV.

Evaluation of Imperceptibility.
e subjective XAB listening test in Section 4 results in average classification accuracy of 47.2%, which confirms the imperceptibility of    Figure 8, where ε 10. From top to bottom are the spectrogram of original utterance, perturbed utterance, and perturbation. As can be seen from the gure, the di erence between the original utterance and the adversarial utterance is tiny. Most areas of the spectrogram of perturbation are very low in energy, which shows the imperceptibility of perturbation.

e Joint Attacks versus the
Step-by-Step One. In addition, to verify that a joint attack is really e ective, the performance of a step-by-step adversarial attack utilizing PGD-100 is evaluated. e step-by-step adversarial attack is divided into two situations: (1) ASV system is attacked rst, and CM system is attacked again (ASV ⟶ CM; i.e., rst α 1 and β 0, and then α 0 and β 1). (2) CM system is attacked rst, and ASV system is attacked again (CM ⟶ ASV; i.e., rst α 0 and β 1, and then α 1 and β 0). e results of experiments are shown in Table 3. Because the number of labels {target, nontarget, spoof} in protocols (ASVSpoof 2019.LA.asv.dev.gi.trl.txt and ASV-Spoof2019.LA.asv.eval. gi.trl.txt) is not balanced, the performances of the two kinds of step-by-step attacks are di erent. e total number of trials is 132,127, the number of target trials is 6,854, nontarget is 39,095, and spoof is 86,178.   Table 4 showed that almost all trials labeled spoof were accepted in the ASV ⟶ CM attack, while trials labeled nontarget were almost rejected. e opposite phenomenon was found in the CM ⟶ ASV attack.

Sample analysis in
It can be seen that the labels of trials have a strong correlation with the results of step-by-step attacks. We believe that the reason for this phenomenon may be that the objective functions of attacking CM and ASV are not exactly  the same. Since the production of adversarial samples is to add perturbation in the whole time and frequency bands, the perturbation added later will cover the perturbation added before. erefore, an adversarial sample that is e ective for one system is invalid for another. But whether it is ASV ⟶ CM or CM ⟶ ASV, the performance of a joint attack is far better than that of a step-by-step attack.

Parameter Sensitivity.
In equations (6) to (10), α and β are introduced to adjust the weight of the loss function for ASV and CM systems, respectively. In order to explore the e ect of changing α and β during joint adversarial attacks, a series of experiments are deployed. Since α + β 1, if one of the two parameters is adjusted, the other will change as well. e variation of FAR is shown in Figure 9. By studying the FAR curves of the tandem system in Figure 9, we see that when α is close to 0 (ASV subsystem has a small weight) or 1 (CM subsystem has a small weight), the e ect of adversarial attacks is not satisfactory, when α β 0.5, the best attack result can be obtained. e experiments show that the attack ability of subsystems with lower weight will be signi cantly reduced when the weight of ASV or CM loss function is reduced. e performance degradation when reducing α is more gradual than when reducing β. is phenomenon may be due to the uneven distribution of trials belonging to di erent labels. Trials labeled spoof are more numerous than nontarget. Modifying α and β shows that when attacking a tandem system, a drop in the weight of either loss function is not tolerated. It is not desirable to sacri ce the performance of one subsystem for the performance of the other. Because both subsystems play a vital role in the tandem system, it is critical to keep both subsystems performing well.

Conclusion
In this paper, an attack method for the tandem system of ASV and CM is proposed. PGD and FGSM are utilized to implement attacks on the tandem system. rough the proposed attack method, the tandem system can be attacked successfully.
e vulnerability of the tandem system to adversarial attacks is revealed. In the future, black-box attacks against tandem systems will be explored, and adversarial defense and detection methods will also be utilized to improve the robustness and security of the tandem system.

Data Availability
e data used to support the ndings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no known competing nancial interests or personal relationships that could have appeared to in uence the work reported in this paper.