pedroarthur · November 6, 2025 08:28
diff --git a/iat.ipynb b/iat.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# The Dataset\n",
    "\n",
    "The dataset consists of a CSV representation of network traffic. Each entry consists of the headers of a single network packet (payload is suppressed). The following fields are present:\n",
    "\n",
    " - **timestamp**: the capture time in nanoseconds since epoch\n",
    " - **src_ip**: the network address of the source of the packet (when applicable)\n",
    " - **src_port**: the source port of the packet (when applicable)\n",
    " - **dst_ip**: the network address of the destination of the packet (when applicable)\n",
    " - **dst_port**: the destination port of the packet (when applicable)\n",
    " - **protocol**: the protocol of the packet\n",
    " - **size**: the size of the packet\n",
    "\n",
    "The **src_ip** and **dst_ip** fields will contain IPv4 or IPv6 addresses, when applicable. Otherwise, they will contain a _nil_ placeholder. The **src_port** and **dst_port** fields will contain an integer _p_ where _p > 0_ when ports are applicable to the protocol, _i == 0_ otherwise.\n",
    "\n",
    "The **protocol** field will contain _DecodeFailure_ when we have failed to identify the protocol; it will contain _Fragment_ when the packet is a fragment (_i.e._ a continuation of another packet); otherwise **protocol** will contain the highest identified protocol in the stack (_i.e._ TCP, UDP, ICMPv4, ICMPv6, among others).\n",
    "\n",
    "In this dataset, two packets belong to the same conversation if they share the same _5-tuple_ consisting of **src_ip**, **src_port**, **dst_ip**, **dst_port**, and **protocol**. It is important to note that _source_ and _destination_ roles varies during a conversation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# The Problems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Packets Inter-arrival times\n",
    "\n",
    "The inter-arrival time (IAT) is defined as the time between two packets originated by the same host. Consider the following excerpt of the dataset:\n",
    "\n",
    "    timestamp,src_ip,src_port,dst_ip,dst_port,protocol,size\n",
    "    1533739862306962000,10.50.5.1,4343,10.50.230.15,49706,TCP,753\n",
    "    1533739862310621000,10.50.5.1,4343,10.50.230.15,49706,TCP,60\n",
    "\n",
    "The IAT of these packets is 3.66 milliseconds.\n",
    "\n",
    "Write a function that receives two packets and returns the IAT of these packets. Considering the example above, the output is expected to be something like the following:\n",
    "\n",
    "    src_ip,src_port,dst_ip,dst_port,protocol,iat\n",
    "    10.50.5.1,4343,10.50.230.15,49706,TCP,3659000\n",
    "\n",
    "The format above is given as an example of what we would want to achieve. The function should return a single value:\n",
    "\n",
    "    assert iat(current_p, previous_p) == 3659000\n",
    "\n",
    "### What we expect:\n",
    "\n",
    " - Test cases and Unit tests"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def iat(current, previous):\n",
    "    raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## IAT histogram of a Conversation\n",
    "\n",
    "Your task is to write a function that calculates the IAT of the packets in a given conversation. The conversation will be fed as list of packets such as the presented in the previous problem statement. The function should generate a single line consisting of a histogram for that conversation, where the bins represents how many packets fall within a given IAT. For example, if we had the following dataset:\n",
    "\n",
    "    src_ip,src_port,dst_ip,dst_port,protocol,iat\n",
    "    192.168.0.1,80,192.168.0.2,35468,TCP,61\n",
    "    192.168.0.1,80,192.168.0.2,35468,TCP,34\n",
    "    192.168.0.1,80,192.168.0.2,35468,TCP,98\n",
    "    192.168.0.1,80,192.168.0.2,35468,TCP,70\n",
    "    192.168.0.1,80,192.168.0.2,35468,TCP,31\n",
    "\n",
    "The function output should be something like:\n",
    "\n",
    "    src_ip,src_port,dst_ip,dst_port,protocol,b0,b1,b2,b3\n",
    "    192.168.0.1,80,192.168.0.2,35468,TCP,0,2,2,1\n",
    "\n",
    "Where the procedure considered the existence of four bins of 25 time-units width. The number of bins and the bin width must be defined by examining the dataset. The format above is given as an example of what we want to achieve. The function should return any array-like representation of the bins:\n",
    "\n",
    "    assert conversation_iat_histogram(conversation) == [0, 2, 2, 1]\n",
    "\n",
    "**IMPORTANT**: the samples in this problem statement show entries with the IAT already calculated. Note that the requirement is to receive a list of packets such as the ones in the dataset provided.\n",
    "\n",
    "### What we expect:\n",
    "\n",
    " - A rationale for the number of bins and the bin width\n",
    " - Plots or tables that corroborate the decisions\n",
    " - A rationale about any non-trivial insights used in the code\n",
    " - Test cases and Unit tests"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def conversation_iat_histogram(conversation_packet_list):\n",
    "    raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## IAT histograms of an Address\n",
    "\n",
    "Your task is to write a procedure that calculates the IAT histograms of all conversations of an address. The procedure must identify all conversations in the dataset and return a list of IAT histograms, one for each conversation identified. Each item in the list must conform to the output of the _IAT of a Conversation_ function. As this function in intended to be a hot-spot in the application, use the disk to enable the recalculation of the IATs in _O(n)_, where _n_ is the number of conversations for that address.\n",
    "\n",
    "### What we expect:\n",
    "\n",
    " - A preprocessing function that uses the disk to store an intermediate state that enables the calculation of the IATs in _O(n)_\n",
    " - A rationale about any non-trivial insights used in the code\n",
    " - Test cases and Unit tests"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def address_iat_histograms(address_conversations):\n",
    "    raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Central Tendency of the IAT of an Address\n",
    "\n",
    "Your task is to choose a central tendency function and write a procedure that calculates the central tendency of the IAT of an address.\n",
    "\n",
    "### What we expect:\n",
    "\n",
    " - A rationale about the central tendency chosen\n",
    " - Plots or tables that corroborate the decisions\n",
    " - A rationale about any non-trivial insights used in the code\n",
    " - Test cases and Unit tests"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def address_expected_iat(address):\n",
    "    raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Submission\n",
    "\n",
    "You should e-mail a _i)_ notebook with the solution and _ii)_ a HTML copy showing tables or plots. Please, make sure that the images are visible for someone viewing the notebook in another machine. Besides that, provide a `requirements.txt` file with the dependencies of the notebook if you use Python 3.x, or a `Dockerfile` that showcase the solution if you use another technology. Last, we will test the notebook using `Run > Restart Kernel and Run all cells`; make sure to test that from scratch!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
diff --git a/network-traffic.csv.gz b/network-traffic.csv.gz
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# The Dataset\n",
	"\n",
	"The dataset consists of a CSV representation of network traffic. Each entry consists of the headers of a single network packet (payload is suppressed). The following fields are present:\n",
	"\n",
	" - timestamp: the capture time in nanoseconds since epoch\n",
	" - src_ip: the network address of the source of the packet (when applicable)\n",
	" - src_port: the source port of the packet (when applicable)\n",
	" - dst_ip: the network address of the destination of the packet (when applicable)\n",
	" - dst_port: the destination port of the packet (when applicable)\n",
	" - protocol: the protocol of the packet\n",
	" - size: the size of the packet\n",
	"\n",
	"The src_ip and dst_ip fields will contain IPv4 or IPv6 addresses, when applicable. Otherwise, they will contain a _nil_ placeholder. The src_port and dst_port fields will contain an integer _p_ where _p > 0_ when ports are applicable to the protocol, _i == 0_ otherwise.\n",
	"\n",
	"The protocol field will contain _DecodeFailure_ when we have failed to identify the protocol; it will contain _Fragment_ when the packet is a fragment (_i.e._ a continuation of another packet); otherwise protocol will contain the highest identified protocol in the stack (_i.e._ TCP, UDP, ICMPv4, ICMPv6, among others).\n",
	"\n",
	"In this dataset, two packets belong to the same conversation if they share the same _5-tuple_ consisting of src_ip, src_port, dst_ip, dst_port, and protocol. It is important to note that _source_ and _destination_ roles varies during a conversation."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# The Problems"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Packets Inter-arrival times\n",
	"\n",
	"The inter-arrival time (IAT) is defined as the time between two packets originated by the same host. Consider the following excerpt of the dataset:\n",
	"\n",
	" timestamp,src_ip,src_port,dst_ip,dst_port,protocol,size\n",
	" 1533739862306962000,10.50.5.1,4343,10.50.230.15,49706,TCP,753\n",
	" 1533739862310621000,10.50.5.1,4343,10.50.230.15,49706,TCP,60\n",
	"\n",
	"The IAT of these packets is 3.66 milliseconds.\n",
	"\n",
	"Write a function that receives two packets and returns the IAT of these packets. Considering the example above, the output is expected to be something like the following:\n",
	"\n",
	" src_ip,src_port,dst_ip,dst_port,protocol,iat\n",
	" 10.50.5.1,4343,10.50.230.15,49706,TCP,3659000\n",
	"\n",
	"The format above is given as an example of what we would want to achieve. The function should return a single value:\n",
	"\n",
	" assert iat(current_p, previous_p) == 3659000\n",
	"\n",
	"### What we expect:\n",
	"\n",
	" - Test cases and Unit tests"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"def iat(current, previous):\n",
	" raise NotImplementedError"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## IAT histogram of a Conversation\n",
	"\n",
	"Your task is to write a function that calculates the IAT of the packets in a given conversation. The conversation will be fed as list of packets such as the presented in the previous problem statement. The function should generate a single line consisting of a histogram for that conversation, where the bins represents how many packets fall within a given IAT. For example, if we had the following dataset:\n",
	"\n",
	" src_ip,src_port,dst_ip,dst_port,protocol,iat\n",
	" 192.168.0.1,80,192.168.0.2,35468,TCP,61\n",
	" 192.168.0.1,80,192.168.0.2,35468,TCP,34\n",
	" 192.168.0.1,80,192.168.0.2,35468,TCP,98\n",
	" 192.168.0.1,80,192.168.0.2,35468,TCP,70\n",
	" 192.168.0.1,80,192.168.0.2,35468,TCP,31\n",
	"\n",
	"The function output should be something like:\n",
	"\n",
	" src_ip,src_port,dst_ip,dst_port,protocol,b0,b1,b2,b3\n",
	" 192.168.0.1,80,192.168.0.2,35468,TCP,0,2,2,1\n",
	"\n",
	"Where the procedure considered the existence of four bins of 25 time-units width. The number of bins and the bin width must be defined by examining the dataset. The format above is given as an example of what we want to achieve. The function should return any array-like representation of the bins:\n",
	"\n",
	" assert conversation_iat_histogram(conversation) == [0, 2, 2, 1]\n",
	"\n",
	"IMPORTANT: the samples in this problem statement show entries with the IAT already calculated. Note that the requirement is to receive a list of packets such as the ones in the dataset provided.\n",
	"\n",
	"### What we expect:\n",
	"\n",
	" - A rationale for the number of bins and the bin width\n",
	" - Plots or tables that corroborate the decisions\n",
	" - A rationale about any non-trivial insights used in the code\n",
	" - Test cases and Unit tests"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"def conversation_iat_histogram(conversation_packet_list):\n",
	" raise NotImplementedError"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## IAT histograms of an Address\n",
	"\n",
	"Your task is to write a procedure that calculates the IAT histograms of all conversations of an address. The procedure must identify all conversations in the dataset and return a list of IAT histograms, one for each conversation identified. Each item in the list must conform to the output of the _IAT of a Conversation_ function. As this function in intended to be a hot-spot in the application, use the disk to enable the recalculation of the IATs in _O(n)_, where _n_ is the number of conversations for that address.\n",
	"\n",
	"### What we expect:\n",
	"\n",
	" - A preprocessing function that uses the disk to store an intermediate state that enables the calculation of the IATs in _O(n)_\n",
	" - A rationale about any non-trivial insights used in the code\n",
	" - Test cases and Unit tests"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"def address_iat_histograms(address_conversations):\n",
	" raise NotImplementedError"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Central Tendency of the IAT of an Address\n",
	"\n",
	"Your task is to choose a central tendency function and write a procedure that calculates the central tendency of the IAT of an address.\n",
	"\n",
	"### What we expect:\n",
	"\n",
	" - A rationale about the central tendency chosen\n",
	" - Plots or tables that corroborate the decisions\n",
	" - A rationale about any non-trivial insights used in the code\n",
	" - Test cases and Unit tests"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"def address_expected_iat(address):\n",
	" raise NotImplementedError"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Submission\n",
	"\n",
	"You should e-mail a _i)_ notebook with the solution and _ii)_ a HTML copy showing tables or plots. Please, make sure that the images are visible for someone viewing the notebook in another machine. Besides that, provide a `requirements.txt` file with the dependencies of the notebook if you use Python 3.x, or a `Dockerfile` that showcase the solution if you use another technology. Last, we will test the notebook using `Run > Restart Kernel and Run all cells`; make sure to test that from scratch!"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.4"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}
No results found