1.1 Problem statement
In this project, we propose the use of policy gradient methods to perform spoofing of devices in wireless networks by impersonating their wireless transmission fingerprints. Physical layer authentication relies on detecting unique imperfections in signals transmitted by radio devices in order to obtain their fingerprint and identify them. These imperfections are present within analog components of a radio device and can differentiate radio devices even if their manufacturer and make/model are identical. Radio fingerprints are usually considered hard to reproduce or replay because the replicating or replaying device suffers from its own impairments which disturb the features in the RF fingerprint.
1.2 System Model
We consider a wireless environment in which there are transmitters which are authorized to transmit to a single receiver . is equipped with a pre-trained neural network-based authenticator that uses raw IQ samples of the received signals to perform a binary authentication decision at the physical layer, denoting whether the signal under consideration is from an authorized transmitter or not. There is an adversarial transmitter that wants to communicate with , and it tries to do this by impersonating one of the authorized transmitters.
In a wireless communication system, there are three main sources of non-linearities that are imparted on the intended transmitted signal: if is the signal at the beginning of the transmitter chain, the signal at the end of the receiver chain will be of the form , where , and are fingerprints introduced by the receiver hardware, channel and transmitter hardware respectively. Physical-layer based wireless authentication systems in the literature are mostly designed to differentiate transmitters based on (for example, in ), so we will try to emulate a similar setting in our approach as well. We assume that we can place an adversarial receiver close-enough to such that the channel from to is similar to the channel from to . So receives signals with a similar to signals that receives. Furthermore, as a simplification, in this project we will assume that we can find an device with a similar to , which is not unreasonable since high quality wireless receivers of the same make/model will have a smaller variance of . This means that effectively, and will have learned to discriminate based on the same .
Now builds a discriminator that tries to distinguish between the signals it receives from and the signals it receives from . Note that will have a ground truth since includes a flag in its transmitted signals. For each received signal from , transmits its classification decision back to as feedback. tries to build a generator whose purpose is to distort the complex IQ samples of input discrete time signal at such that after it is transmitted, it will be classified as authenticated at . At convergence, should be able to generate signals that are good enough to fool (at ) into believing that they are from .
1.3 Proposed Solution
The idea is to model and as neural networks. takes most recent samples of the sampled signal and outputs a scalar value through a sigmoid that can be thresholded to get a binary value, representing whether the signal is authorized or not. We also plan to attempt using classes, with corresponding to authorized transmitters and a single class corresponding to non-authorized transmitters. Since is aware of the ground-truth, it is straightforward to train through gradient descent in a classical supervised learning fashion. However, this supervised method is not available for as there is no ground-truth. So we propose to train it using policy gradient methods.
We can model the neural network as the policy of a Markov Decision Process.
The action is the output of the neural network , where is the state at time . The generator can then be viewed as the source of the signal . The signal is transmitted to the receiver to obtain . The reward at time will be the feedback from based on . Depending on the main type of distortion that the impersonator tries to mimic, different definitions of the state can be used.
The state is a vector containing the real and imaginary part of the most recent sample of the signal . This is applicable when the distortion over each sample is independent of the other samples. For example, the distortion imparted by the power amplifier in the RF chain will have this property .
The state is a sequence of length , . is a hyperparamter. This state is applicable when the distortion of the most recent sample of the signal is dependent on previously transmitted samples and the current sample.
The state is , where is is the hidden state of , when it is modeled as a recurrent neural network. This state can in theory apply to any type of non-linearity.
Assume we have collected a trajectory defined as a sequence of states, actions, and rewards, . Now, the goal is to tune the parameters (weights and biases) of :
To solve this problem, we can use a policy gradient method: we repeatedly estimate the gradient of the policy’s performance with respect to its parameters and use that to update its parameters. To estimate the gradients, we will use a score function gradient estimator. With the introduction of a baseline to reduce variance, an estimate for is 
Now, the policy update can be performed with stochastic gradient ascent . This process can be repeated for a number of iterations in an algorithm termed Vanilla Policy Gradient, with a trajectory collected for each iteration, until converges to a satisfactory state. This algorithm also allows for to be trained along with .
In our case, symbols of the signal have to be processed (perturbed) by the generator before transmitting, and hence a reward is not immediately available for every state . To estimate , we propose to perform a Monte-Carlo search from until the final state, using a roll-out policy , which may or may not be equal to .
2 Experimental Evaluation
For a preliminary evaluation of our proposed approach, we made the following assumptions: 1. is absent and itself will provide us with a scalar value denoting the probability that the signal from is authorized. This will be used to construct the reward. 2. The channel between the and is perfect (there are no channel effects). 3. The transmitter fingerprint is due to the power amplifier non-linearity, modeled by the Saleh model . 4. We use the state definition 3 in Sec. 1.3. 5. is initialized to and is periodically updated to .
To model the generator we used a simple LSTM recurrent neural network that outputs the mean and covariance of a two dimensional Gaussian distribution, which we can sample to obtain the action, and also use to find the action probability (when calculating gradients). We also explored three possible alternatives for the discriminator architecture: 1. Binary discriminator (BDisc) - with a single sigmoid output 2. Multi-class discriminator (MDisc) - classes, the last one representing an outlier and 3. One-vs-All discriminator (OvA) - a model with binary discriminator networks, each predicting whether the input belongs to a given transmitter. Although OvA should perform better theoretically, the binary discriminator was the most stable, and was used for this evaluation. However, we expect to fine-tune the architecture and hyper-parameters of OvA to use in future iterations. To train BDisc, we assumed there were 5 known unauthorized transmitters, signals from whom were used as negative samples. It was able to achieve an average testing accuracy of 80% (with both authorized and impersonator signals) and a impersonator rejection accuracy of 95%.
To begin with, the generator was initialized to one which acts like an autoencoder, so that it does not add any perturbations (this can be done by training with MSE loss). When training the generator using the proposed approach, the main obstacle was stability. As shown in Fig. 3, on some occasions it was able to converge to a state where it was able to fool the discriminator with 100% accuracy (this is possible because of the perfect channel assumption). However, on most occasions it would fail to converge at all. We hope to address this issue in the following ways: 1. Gradient clipping (clip gradients of large magnitude) 2. Limiting the action space of the generator - this is important because in practice, each sample of a signal must be restricted to a certain region of the complex plane to remain decodable by the receiver.
The current code for this project can be found at https://github.com/samurdhilbk/siggan.
-  (2019) Deep learning based transmitter identification using power amplifier nonlinearity. In 2019 International Conference on Computing, Networking and Communications (ICNC), pp. 674–680. Cited by: item 1, §1.2, §2.
-  (2020) No radio left behind: radio fingerprinting through deep learning of physical-layer hardware impairments. IEEE Transactions on Cognitive Communications and Networking 6 (1), pp. 165–178. Cited by: §1.2.
-  (2016-12) Optimizing expectations: from deep reinforcement learning to stochastic computation graphs. Ph.D. Thesis, EECS Department, University of California, Berkeley. External Links: Cited by: §1.3.
-  (2017) Seqgan: sequence generative adversarial nets with policy gradient. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §1.3.