fastsk package

Submodules

class fastsk.old_utils.ArabicUtility(vocab=None)[source]

Bases: object

read_data(data_file, vocab='inferred')[source]
Read a file with the following format:

بالمناسبة ، اسمي هيروش إيجيما . MSA مش قادر نرقد كويس في الليل . CAI

That is, a sequence of Arabic characters, a tab, and a three-letter label/city code.

Parameters
  • data_file (string) – The path to the sequences.

  • vocab (string) –

Returns

  • X (list) – list of sequences where characters have been mapped to numbers.

  • Y (list) – list of numerical labels (not one-hot)

class fastsk.old_utils.BlendedSpectrumRunner(exec_dir, data_locaton, prefix, outdir='./temp')[source]

Bases: object

combine_train_and_test()[source]
compute_kernel(k1=3, k2=5, mode='train_and_test')[source]
evaluate_clf()[source]
read_kernel()[source]
train_and_test(k1=3, k2=5, C=1)[source]
write_seq(datafile, mode='train')[source]
class fastsk.old_utils.DslUtility(vocab=None)[source]

Bases: object

read_data(data_file, vocab='inferred')[source]
class fastsk.old_utils.FastaUtility(vocab=None)[source]

Bases: object

read_data(data_file, vocab='inferred', regression=False)[source]

Read a file with the FASTA-like format of alternating labels lines followed by sequences. For example:

>1 >AAAGAT >1 >AAAAAGAT >0 >AGTC

Parameters
  • data_file (string) – The path to the sequences.

  • vocab (string) –

Returns

  • X (list) – list of sequences where characters have been mapped to numbers.

  • Y (list) – list of labels

shortest_seq(data_file)[source]
class fastsk.old_utils.FastskRegressor(dataset, data_location='../data')[source]

Bases: object

compute_train_kernel(g, m, t=20, approx=True, I=100, delta=0.025, skip_variance=False)[source]
train_and_test(g, m, t, approx, I=100, delta=0.025, skip_variance=False)[source]
class fastsk.old_utils.FastskRunner(prefix, data_location='../data')[source]

Bases: object

compute_train_kernel(g, m, t=20, approx=True, I=100, delta=0.025, skip_variance=False)[source]
evaluate_clf()[source]
train_and_test(g, m, t, approx, I=100, delta=0.025, skip_variance=False, C=1)[source]
class fastsk.old_utils.GaKCoRunner(exec_location, data_locaton, type_, prefix, outdir='./temp')[source]

Bases: object

combine_train_and_test()[source]
compute_kernel(g, m, mode='train', t=1)[source]
evaluate_clf()[source]
get_labels()[source]
read_kernel()[source]
read_labels()[source]
train_and_test(g, m, C=1)[source]
class fastsk.old_utils.GkmRunner(exec_location, data_locaton, dataset, g, k, approx=False, alphabet=None, outdir='./temp')[source]

Bases: object

classify()[source]
compute_train_kernel(t)[source]
evaluate()[source]
get_accuracy(pos_preds, neg_preds)[source]
get_auc(pos_preds, neg_preds)[source]
read_preds(file)[source]
train_and_test(t=20)[source]
train_svm()[source]
alphabet

Important note: gkmSVM’s -d parameter (max_m) is not the same as our m = g - k parameter. It’s actually the upper bound of the summation shown in equation 3 in the 2014 gkmSVM paper (ghandi2014enhanced).

g

Important note: gkmSVM’s -d parameter (max_m) is not the same as our m = g - k parameter. It’s actually the upper bound of the summation shown in equation 3 in the 2014 gkmSVM paper (ghandi2014enhanced).

k

Important note: gkmSVM’s -d parameter (max_m) is not the same as our m = g - k parameter. It’s actually the upper bound of the summation shown in equation 3 in the 2014 gkmSVM paper (ghandi2014enhanced).

max_m

If using the exact algo, the summation runs from 0 to l (their l is our g)

class fastsk.old_utils.Vocabulary[source]

Bases: object

A class for storing the vocabulary of a sequence dataset. Maps words or characters to indexes in the vocabulary.

add(token)[source]

Add a token to the vocabulary. :param token: a letter (for char-level model) or word (for word-level model) :param for which to create a mapping to an integer: :type for which to create a mapping to an integer: the idx

Returns

the index of the word. If it’s already present, return its index. Otherwise, add it before returning the index.

size()[source]

Return the number tokens in the vocabulary.

fastsk.old_utils.fastsk_wrap(dataset, g, m, t, approx, I, delta, skip_variance, C, return_dict)[source]
fastsk.old_utils.gkm_wrap(g, m, t, prefix, gkm_data, gkm_exec, approx, timeout, alphabet, return_dict)[source]
fastsk.old_utils.time_blended(k1, k2, prefix, timeout=None)[source]
fastsk.old_utils.time_fastsk(g, m, t, data_location, prefix, approx=False, max_iters=None, timeout=None, skip_variance=False)[source]

Run FastSK kernel computation. If a timeout is provided, it’ll run as a subprocess, which will be killed when the timeout is reached.

fastsk.old_utils.time_gakco(g, m, type_, prefix, timeout=None)[source]
fastsk.old_utils.time_gkm(g, m, t, prefix, gkm_data, gkm_exec, approx=False, timeout=None, alphabet=None)[source]

Run gkm-SVM2.0 kernel computation. If a timeout is provided, it’ll be run as a subprocess, which will be killed when the timeout is reached.

fastsk.old_utils.train_and_test_fastsk(dataset, g, m, t, approx, I=50, delta=0.025, skip_variance=False, C=1, timeout=None)[source]
fastsk.old_utils.train_and_test_gkm(g, m, t, prefix, gkm_data, gkm_exec, approx=False, timeout=None, alphabet=None)[source]

Utils for reading fasta files

class fastsk.utils.FastaUtility(vocab=None)[source]

Bases: object

read_data(data_file, vocab='inferred', regression=False)[source]

Read a file with the FASTA-like format of alternating labels lines followed by sequences. For example:

>1 >AAAGAT >1 >AAAAAGAT >0 >AGTC

Parameters
  • data_file (string) – The path to the sequences.

  • vocab (string) –

Returns

  • X (list) – list of sequences where characters have been mapped to numbers.

  • Y (list) – list of labels

shortest_seq(data_file)[source]
class fastsk.utils.Vocabulary[source]

Bases: object

A class for storing the vocabulary of a sequence dataset. Maps words or characters to indexes in the vocabulary.

add(token)[source]

Add a token to the vocabulary. :param token: a letter (for char-level model) or word (for word-level model) :param for which to create a mapping to an integer: :type for which to create a mapping to an integer: the idx

Returns

the index of the word. If it’s already present, return its index. Otherwise, add it before returning the index.

size()[source]

Return the number tokens in the vocabulary.