1 Introduction
1.1 Problem statement
In this project, we propose the use of policy gradient methods to perform spoofing of devices in wireless networks by impersonating their wireless transmission fingerprints. Physical layer authentication relies on detecting unique imperfections in signals transmitted by radio devices in order to obtain their fingerprint and identify them. These imperfections are present within analog components of a radio device and can differentiate radio devices even if their manufacturer and make/model are identical. Radio fingerprints are usually considered hard to reproduce or replay because the replicating or replaying device suffers from its own impairments which disturb the features in the RF fingerprint.
1.2 System Model
We consider a wireless environment in which there are $T$ transmitters $T$ which are authorized to transmit to a single receiver $R$. $R$ is equipped with a pretrained neural networkbased authenticator ${D}_{R}$ that uses raw IQ samples of the received signals to perform a binary authentication decision at the physical layer, denoting whether the signal under consideration is from an authorized transmitter or not. There is an adversarial transmitter ${T}_{A}$ that wants to communicate with $R$, and it tries to do this by impersonating one of the $N$ authorized transmitters.
In a wireless communication system, there are three main sources of nonlinearities that are imparted on the intended transmitted signal: if $x(t)$ is the signal at the beginning of the transmitter chain, the signal at the end of the receiver chain will be of the form $y(t)={f}_{R}({f}_{C}({f}_{T}(x(t))))$, where ${f}_{R}$, ${f}_{C}$ and ${f}_{T}$ are fingerprints introduced by the receiver hardware, channel and transmitter hardware respectively. Physicallayer based wireless authentication systems in the literature are mostly designed to differentiate transmitters based on ${f}_{T}$ (for example, in [1]), so we will try to emulate a similar setting in our approach as well. We assume that we can place an adversarial receiver ${R}_{A}$ closeenough to $R$ such that the channel from ${T}_{A}$ to ${R}_{A}$ is similar to the channel from $T$ to $R$. So ${R}_{A}$ receives signals with a similar ${f}_{C}$ to signals that $R$ receives. Furthermore, as a simplification, in this project we will assume that we can find an ${R}_{A}$ device with a similar ${f}_{R}$ to $R$, which is not unreasonable since high quality wireless receivers of the same make/model will have a smaller variance of ${f}_{R}$ [2]. This means that effectively, ${D}_{R}$ and $D$ will have learned to discriminate based on the same ${f}_{T}$.
Now ${R}_{A}$ builds a discriminator $D$ that tries to distinguish between the signals it receives from $T$ and the signals it receives from ${T}_{A}$. Note that ${R}_{A}$ will have a ground truth since ${T}_{A}$ includes a flag in its transmitted signals. For each received signal from ${T}_{A}$, ${R}_{A}$ transmits its classification decision back to ${T}_{A}$ as feedback. ${T}_{A}$ tries to build a generator $G$ whose purpose is to distort the complex IQ samples of input discrete time signal $z(n)$ at ${T}_{A}$ such that after it is transmitted, it will be classified as authenticated at $D$. At convergence, ${T}_{A}$ should be able to generate signals that are good enough to fool ${D}_{R}$ (at $R$) into believing that they are from $T$.
1.3 Proposed Solution
The idea is to model $D$ and $G$ as neural networks. $D$ takes ${N}_{S}$ most recent samples of the sampled signal $y(n)$ and outputs a scalar value through a sigmoid that can be thresholded to get a binary value, representing whether the signal is authorized or not. We also plan to attempt using $T+1$ classes, with $T$ corresponding to $T$ authorized transmitters and a single class corresponding to nonauthorized transmitters. Since $D$ is aware of the groundtruth, it is straightforward to train $D$ through gradient descent in a classical supervised learning fashion. However, this supervised method is not available for $G$ as there is no groundtruth. So we propose to train it using policy gradient methods.
We can model the neural network $G$ as the policy $\pi $ of a Markov Decision Process.
The action ${a}_{n}\in {\mathbb{R}}^{2}$ is the output of the neural network $\pi ({s}_{n})$, where ${s}_{n}$ is the state at time $n$. The generator $G$ can then be viewed as the source of the signal $a(n)={a}_{n}[0]+j{a}_{n}[1]$. The signal $a(n)$ is transmitted to the receiver to obtain $y(n)={f}_{R}({f}_{C}({f}_{T}(a(n))))$. The reward ${r}_{n}$ at time $n$ will be the feedback from $D$ based on $\{y(n),y(n1),\mathrm{\dots},y(n{N}_{s})\}$. Depending on the main type of distortion that the impersonator tries to mimic, different definitions of the state ${s}_{n}$ can be used.

1.
The state is a vector ${s}_{n}=[\text{Re}\{z(n)\},\text{Im}\{z(n)\}]$ containing the real and imaginary part of the most recent sample of the signal $z(n)$. This is applicable when the distortion over each sample is independent of the other samples. For example, the distortion imparted by the power amplifier in the RF chain will have this property [1].

2.
The state is a sequence of length ${N}_{P}$, ${s}_{n}=\{[\text{Re}\{z(n)\},\text{Im}\{z(n)\}],{a}_{n1},\mathrm{\dots},{a}_{n{N}_{P}1}\}$. ${N}_{P}$ is a hyperparamter. This state is applicable when the distortion of the most recent sample of the signal is dependent on previously transmitted samples and the current sample.

3.
The state ${s}_{n}$ is $\{[\text{Re}\{z(n)\},\text{Im}\{z(n)\}],{H}_{n1}\}$, where is ${H}_{n1}$ is the hidden state of $G({s}_{n1})$, when it is modeled as a recurrent neural network. This state can in theory apply to any type of nonlinearity.
Assume we have collected a trajectory $\tau $ defined as a sequence of states, actions, and rewards, $\{{s}_{0},{a}_{0},{r}_{0},{s}_{1},{a}_{1},{r}_{1},\mathrm{\dots},{s}_{T}\}$. Now, the goal is to tune the parameters $\theta $ (weights and biases) of $G$:
$\begin{array}{c}{\text{maximize}}_{\theta}{\mathbb{E}}_{\tau}[{r}_{n}\theta ]\hfill \end{array}$  (1) 
To solve this problem, we can use a policy gradient method: we repeatedly estimate the gradient of the policy’s performance with respect to its parameters and use that to update its parameters. To estimate the gradients, we will use a score function gradient estimator. With the introduction of a baseline $b(s)$ to reduce variance, an estimate $\widehat{g}$ for ${\nabla}_{\theta}{\mathbb{E}}_{\tau}[{r}_{n}\theta ]$ is [3]
${\nabla}_{\theta}{\mathbb{E}}_{\tau}[{r}_{n}\theta ]\approx \widehat{g}={\displaystyle \sum _{n=0}^{T1}}{\nabla}_{\theta}\mathrm{log}{\pi}_{\theta}({a}_{n}{s}_{n})\left({r}_{n}b({s}_{n})\right)$  (2) 
Now, the policy update can be performed with stochastic gradient ascent $\theta \leftarrow \theta +\u03f5\widehat{g}$. This process can be repeated for a number of iterations in an algorithm termed Vanilla Policy Gradient, with a trajectory collected for each iteration, until $G$ converges to a satisfactory state. This algorithm also allows for $b(s)$ to be trained along with $\theta $ [3].
In our case, ${N}_{s}$ symbols of the signal have to be processed (perturbed) by the generator before transmitting, and hence a reward ${r}_{n}$ is not immediately available for every state ${s}_{n}$. To estimate ${r}_{n}$, we propose to perform a MonteCarlo search from ${s}_{n}$ until the final state, using a rollout policy ${G}_{\beta}$, which may or may not be equal to $G$ [4].
2 Experimental Evaluation
For a preliminary evaluation of our proposed approach, we made the following assumptions: 1. ${R}_{A}$ is absent and $R$ itself will provide us with a scalar value denoting the probability that the signal from ${T}_{A}$ is authorized. This will be used to construct the reward. 2. The channel between the $T,{T}_{A}$ and $R$ is perfect (there are no channel effects). 3. The transmitter fingerprint is due to the power amplifier nonlinearity, modeled by the Saleh model [1]. 4. We use the state definition 3 in Sec. 1.3. 5. ${G}_{\beta}$ is initialized to $G$ and is periodically updated to $G$.
To model the generator we used a simple LSTM recurrent neural network that outputs the mean and covariance of a two dimensional Gaussian distribution, which we can sample to obtain the action, and also use to find the action probability (when calculating gradients). We also explored three possible alternatives for the discriminator architecture: 1. Binary discriminator (BDisc)  with a single sigmoid output 2. Multiclass discriminator (MDisc)  $T+1$ classes, the last one representing an outlier and 3. OnevsAll discriminator (OvA)  a model with $T$ binary discriminator networks, each predicting whether the input belongs to a given transmitter. Although OvA should perform better theoretically, the binary discriminator was the most stable, and was used for this evaluation. However, we expect to finetune the architecture and hyperparameters of OvA to use in future iterations. To train BDisc, we assumed there were 5 known unauthorized transmitters, signals from whom were used as negative samples. It was able to achieve an average testing accuracy of 80% (with both authorized and impersonator signals) and a impersonator rejection accuracy of 95%.
To begin with, the generator was initialized to one which acts like an autoencoder, so that it does not add any perturbations (this can be done by training with MSE loss). When training the generator using the proposed approach, the main obstacle was stability. As shown in Fig. 3, on some occasions it was able to converge to a state where it was able to fool the discriminator with 100% accuracy (this is possible because of the perfect channel assumption). However, on most occasions it would fail to converge at all. We hope to address this issue in the following ways: 1. Gradient clipping (clip gradients of large magnitude) 2. Limiting the action space of the generator  this is important because in practice, each sample of a signal must be restricted to a certain region of the complex plane to remain decodable by the receiver.
The current code for this project can be found at https://github.com/samurdhilbk/siggan.
References
 [1] (2019) Deep learning based transmitter identification using power amplifier nonlinearity. In 2019 International Conference on Computing, Networking and Communications (ICNC), pp. 674–680. Cited by: item 1, §1.2, §2.
 [2] (2020) No radio left behind: radio fingerprinting through deep learning of physicallayer hardware impairments. IEEE Transactions on Cognitive Communications and Networking 6 (1), pp. 165–178. Cited by: §1.2.
 [3] (201612) Optimizing expectations: from deep reinforcement learning to stochastic computation graphs. Ph.D. Thesis, EECS Department, University of California, Berkeley. External Links: Link Cited by: §1.3.
 [4] (2017) Seqgan: sequence generative adversarial nets with policy gradient. In ThirtyFirst AAAI Conference on Artificial Intelligence, Cited by: §1.3.