fastsk package¶
Submodules¶
- class fastsk.old_utils.ArabicUtility(vocab=None)[source]¶
Bases:
object
- read_data(data_file, vocab='inferred')[source]¶
- Read a file with the following format:
بالمناسبة ، اسمي هيروش إيجيما . MSA مش قادر نرقد كويس في الليل . CAI
That is, a sequence of Arabic characters, a tab, and a three-letter label/city code.
- Parameters
data_file (string) – The path to the sequences.
vocab (string) –
- Returns
X (list) – list of sequences where characters have been mapped to numbers.
Y (list) – list of numerical labels (not one-hot)
- class fastsk.old_utils.BlendedSpectrumRunner(exec_dir, data_locaton, prefix, outdir='./temp')[source]¶
Bases:
object
- class fastsk.old_utils.FastaUtility(vocab=None)[source]¶
Bases:
object
- read_data(data_file, vocab='inferred', regression=False)[source]¶
Read a file with the FASTA-like format of alternating labels lines followed by sequences. For example:
>1 >AAAGAT >1 >AAAAAGAT >0 >AGTC
- Parameters
data_file (string) – The path to the sequences.
vocab (string) –
- Returns
X (list) – list of sequences where characters have been mapped to numbers.
Y (list) – list of labels
- class fastsk.old_utils.GaKCoRunner(exec_location, data_locaton, type_, prefix, outdir='./temp')[source]¶
Bases:
object
- class fastsk.old_utils.GkmRunner(exec_location, data_locaton, dataset, g, k, approx=False, alphabet=None, outdir='./temp')[source]¶
Bases:
object
- alphabet¶
Important note: gkmSVM’s -d parameter (max_m) is not the same as our m = g - k parameter. It’s actually the upper bound of the summation shown in equation 3 in the 2014 gkmSVM paper (ghandi2014enhanced).
- g¶
Important note: gkmSVM’s -d parameter (max_m) is not the same as our m = g - k parameter. It’s actually the upper bound of the summation shown in equation 3 in the 2014 gkmSVM paper (ghandi2014enhanced).
- k¶
Important note: gkmSVM’s -d parameter (max_m) is not the same as our m = g - k parameter. It’s actually the upper bound of the summation shown in equation 3 in the 2014 gkmSVM paper (ghandi2014enhanced).
- max_m¶
If using the exact algo, the summation runs from 0 to l (their l is our g)
- class fastsk.old_utils.Vocabulary[source]¶
Bases:
object
A class for storing the vocabulary of a sequence dataset. Maps words or characters to indexes in the vocabulary.
- add(token)[source]¶
Add a token to the vocabulary. :param token: a letter (for char-level model) or word (for word-level model) :param for which to create a mapping to an integer: :type for which to create a mapping to an integer: the idx
- Returns
the index of the word. If it’s already present, return its index. Otherwise, add it before returning the index.
- fastsk.old_utils.fastsk_wrap(dataset, g, m, t, approx, I, delta, skip_variance, C, return_dict)[source]¶
- fastsk.old_utils.gkm_wrap(g, m, t, prefix, gkm_data, gkm_exec, approx, timeout, alphabet, return_dict)[source]¶
- fastsk.old_utils.time_fastsk(g, m, t, data_location, prefix, approx=False, max_iters=None, timeout=None, skip_variance=False)[source]¶
Run FastSK kernel computation. If a timeout is provided, it’ll run as a subprocess, which will be killed when the timeout is reached.
- fastsk.old_utils.time_gkm(g, m, t, prefix, gkm_data, gkm_exec, approx=False, timeout=None, alphabet=None)[source]¶
Run gkm-SVM2.0 kernel computation. If a timeout is provided, it’ll be run as a subprocess, which will be killed when the timeout is reached.
- fastsk.old_utils.train_and_test_fastsk(dataset, g, m, t, approx, I=50, delta=0.025, skip_variance=False, C=1, timeout=None)[source]¶
- fastsk.old_utils.train_and_test_gkm(g, m, t, prefix, gkm_data, gkm_exec, approx=False, timeout=None, alphabet=None)[source]¶
Utils for reading fasta files
- class fastsk.utils.FastaUtility(vocab=None)[source]¶
Bases:
object
- read_data(data_file, vocab='inferred', regression=False)[source]¶
Read a file with the FASTA-like format of alternating labels lines followed by sequences. For example:
>1 >AAAGAT >1 >AAAAAGAT >0 >AGTC
- Parameters
data_file (string) – The path to the sequences.
vocab (string) –
- Returns
X (list) – list of sequences where characters have been mapped to numbers.
Y (list) – list of labels
- class fastsk.utils.Vocabulary[source]¶
Bases:
object
A class for storing the vocabulary of a sequence dataset. Maps words or characters to indexes in the vocabulary.
- add(token)[source]¶
Add a token to the vocabulary. :param token: a letter (for char-level model) or word (for word-level model) :param for which to create a mapping to an integer: :type for which to create a mapping to an integer: the idx
- Returns
the index of the word. If it’s already present, return its index. Otherwise, add it before returning the index.