The INSDC Reference Sequence Public Dataset enables access to biological reference sequences submitted to the INSDC, where sequences are identified according to checksum. The dataset includes both raw sequence as well as associated metadata.
insdc-reference-sequences
|-- sequence
| |-- 023e92ccde5f86f31ea0844a92dddb86
| |-- 8238c4f8a7915991ac98d769837f9b4b91da2a2297598e50
| |-- bf237796417701948b5f6005d72ca5a0376f3c89e95a1c4f
| |-- c2424a8ffca9cf8f9ef46cfdd5f69efede74b44e820c178a
| |-- dbe6100b83178f3ac561d98c2dfc41a0
| |-- ff734bf70e13affa85a272fda6659a5f
| |__ *
|__ metadata
|-- json
| |-- 023e92ccde5f86f31ea0844a92dddb86.json
| |-- 8238c4f8a7915991ac98d769837f9b4b91da2a2297598e50.json
| |-- bf237796417701948b5f6005d72ca5a0376f3c89e95a1c4f.json
| |-- c2424a8ffca9cf8f9ef46cfdd5f69efede74b44e820c178a
| |-- dbe6100b83178f3ac561d98c2dfc41a0
| |-- ff734bf70e13affa85a272fda6659a5f
| |__ *.json
|__ csv
|-- AAIYXD01.full.csv
|-- CABIKC01.full.csv
|-- LVHX01.full.csv
|__ *.full.csv
Logs of upload processing events are available from s3://PDS/metadata/csv. Each file represents a single load attempt (generally containing all sequences in an assembly), and each line is a loaded sequence. The following table outlines data columns for each sequence record.
| # | Field Name | Description | Example |
|---|---|---|---|
| 1 | trunc512 | Secure Hash Algorithm (SHA) 512-bit hex-string digest of sequence, truncated to 48 characters | b9046fc3fb417f114d7e108637c448b214d78b7a5e345c7c |
| 2 | md5 | Message Digest (MD5) hex-string digest of sequence (32 characters) | cd8d02e2d8af721bed2ba9392a96da0e |
| 3 | length | Sequence base pair length | 1470266 |
| 4 | sha512 | Secure Hash Algorithm (SHA) 512-bit hex-string digest of sequence (128 characters) | b9046fc3fb417f114d7e108637c448b214d78b7a5e345c7c1d527fd895f081d1109da900101f323d142a407ef22cbfb6c2a174eb796217d1afa7fbbe1564787a |
| 5 | trunc512_base64 | Base64 representation of trunc512 digest (32 characters) |
uQRvw_tBfxFNfhCGN8RIshTXi3peNFx8 |
| 6 | insdc | INSDC Versioned Accession Number | CABIKC010000001.1 |
| 7 | ena_type | Record type | expanded_con |
| 8 | species | Human readable species taxonomic name (ie. Genus species) | "Saccharomyces cerevisiae" |
| 9 | biosample | BioSample Accession | SAMEA5816324 |
| 10 | taxon | NCBI Taxonomy species identifier | 4932 |
An example csv of loaded sequences is displayed below as a table. The table shows a subset of sequences from assembly GCA_902192315.1, a Saccharomyces cerevisiae genome assembly.
trunc512 |
md5 |
length |
sha512 |
trunc512_base64 |
insdc |
ena_type |
species |
biosample |
taxon |
|---|---|---|---|---|---|---|---|---|---|
b9046fc3fb417f114d7e108637c448b214d78b7a5e345c7c |
cd8d02e2d8af721bed2ba9392a96da0e |
1470266 |
b9046fc3fb417f114d7e108637c448b214d78b7a5e345c7c1d527fd895f081d1109da900101f323d142a407ef22cbfb6c2a174eb796217d1afa7fbbe1564787a |
uQRvw_tBfxFNfhCGN8RIshTXi3peNFx8 |
CABIKC010000001.1 |
expanded_con |
"Saccharomyces cerevisiae" |
SAMEA5816324 |
4932 |
8decdfa7b43090448ae9411a77e2105390855bd0770e0ded |
fa6ea9d18d255f0586cf967071bacf8a |
1062691 |
8decdfa7b43090448ae9411a77e2105390855bd0770e0ded4ec8d19bc17e0d2c2af4c9c38c694502d061bd310547020df5ded87641450a20a6e24985bef5904c |
jezfp7QwkESK6UEad-IQU5CFW9B3Dg3t |
CABIKC010000002.1 |
expanded_con |
"Saccharomyces cerevisiae" |
SAMEA5816324 |
4932 |
e64dd23642d2f4fcd9646eaf844f8b1b66e8dc6ca7199e14 |
e6d87174b53bc10a65e8b30363fe994f |
1092091 |
e64dd23642d2f4fcd9646eaf844f8b1b66e8dc6ca7199e143e84b54652c583958f642604497bdfcf5491c9879da9b8d3e46b8a3f010859e0d27367ab82f7c808 |
5k3SNkLS9PzZZG6vhE-LG2bo3GynGZ4U |
CABIKC010000003.1 |
expanded_con |
"Saccharomyces cerevisiae" |
SAMEA5816324 |
4932 |
b9cc12a05937d362b5a55dc4a38850782c034cdf682bd465 |
178f3cd414e0f23b97397f705fade52d |
912642 |
b9cc12a05937d362b5a55dc4a38850782c034cdf682bd465078fc5bbb321b09c4ad3f9717c52778d08ffb03673c317e3ab18836d110f1db86965c1840ee0bb66 |
ucwSoFk302K1pV3Eo4hQeCwDTN9oK9Rl |
CABIKC010000004.1 |
expanded_con |
"Saccharomyces cerevisiae" |
SAMEA5816324 |
4932 |
ae6c673a1878afc4bf4a2df1ed9667d116e1e294015f8875 |
bba885139f4796326100ff9db55b9235 |
815240 |
ae6c673a1878afc4bf4a2df1ed9667d116e1e294015f8875c6db3be3b313adb9f854b41d10faf4cc796621640158eef5b9f467dd4ff107b7bd9988f5be105b16 |
rmxnOhh4r8S_Si3x7ZZn0Rbh4pQBX4h1 |
CABIKC010000005.1 |
expanded_con |
"Saccharomyces cerevisiae" |
SAMEA5816324 |
4932 |
You can download sequences/metadata via the curl command-line tool. Be sure to include the -L flag, which will redirect to sequence content (stored under the TRUNC512 id file) from the MD5 id file.
curl -L http://ga4gh-refget.s3-website.us-east-2.amazonaws.com/sequence/dbe6100b83178f3ac561d98c2dfc41a0
curl -L http://ga4gh-refget.s3-website.us-east-2.amazonaws.com/metadata/json/dbe6100b83178f3ac561d98c2dfc41a0.jsonYou can use the requests library in Python to download sequences and metadata.
import requests
url_sequence = "http://ga4gh-refget.s3-website.us-east-2.amazonaws.com/sequence/dbe6100b83178f3ac561d98c2dfc41a0"
response_sequence = requests.get(url_sequence)
print(response_sequence.content)
url_metadata = "http://ga4gh-refget.s3-website.us-east-2.amazonaws.com/metadata/json/dbe6100b83178f3ac561d98c2dfc41a0.json"
response_metadata = requests.get(url_metadata)
print(response_metadata.content)You can use the httr library in R to download sequences and metadata.
library(httr)
url.sequence <- "http://ga4gh-refget.s3-website.us-east-2.amazonaws.com/sequence/dbe6100b83178f3ac561d98c2dfc41a0"
response.sequence <- GET(url.sequence)
content(response.sequence, "text")
url.metadata <- "http://ga4gh-refget.s3-website.us-east-2.amazonaws.com/metadata/json/dbe6100b83178f3ac561d98c2dfc41a0.json"
response.metadata <- GET(url.metadata)
content(response.metadata, "text")String sequence = "http://ga4gh-refget.s3-website.us-east-2.amazonaws.com/sequence/dbe6100b83178f3ac561d98c2dfc41a0";
URL urlSequence = new URL(sequence);
HttpURLConnection connectionSequence = (HttpURLConnection) urlSequence.openConnection();
connectionSequence.setRequestMethod("GET");
String metadata = "http://ga4gh-refget.s3-website.us-east-2.amazonaws.com/metadata/json/dbe6100b83178f3ac561d98c2dfc41a0.json"
URL urlMetadata = new URL(metadata);
HttpURLConnection connectionMetadata = (HttpURLConnection) urlMetadata.openConnection();
connectionMetadata.setRequestMethod("GET");Given a genome assembly of interest, the csv data can be used to get checksums, and therefore raw sequence, for all sequences in the assembly. For example, to locate all sequences for assembly GCA_902192315.1, we can request the following to access the CSV:
curl -L http://ga4gh-refget.s3-website.us-east-2.amazonaws.com/metadata/csv/CABIKC01.full.csvThe first and second columns of the resulting csv give us the TRUNC512 and MD5 identifiers, respectively, of all sequences in the assembly. We can use either identifier to download each sequence. Given that the first sequence has a TRUNC512 id of b9046fc3fb417f114d7e108637c448b214d78b7a5e345c7c, we can request:
curl -L http://ga4gh-refget.s3-website.us-east-2.amazonaws.com/sequence/b9046fc3fb417f114d7e108637c448b214d78b7a5e345c7cThe above process can be repeated for all sequences to collect and reconstruct the entire assembly.