fastsk package¶

Submodules¶

class fastsk.old_utils.ArabicUtility(vocab=None)[source]¶

Bases: object

read_data(data_file, vocab='inferred')[source]¶

Read a file with the following format:

بالمناسبة ، اسمي هيروش إيجيما . MSA مش قادر نرقد كويس في الليل . CAI

That is, a sequence of Arabic characters, a tab, and a three-letter label/city code.

Parameters

data_file (string) – The path to the sequences.
vocab (string) –

Returns

X (list) – list of sequences where characters have been mapped to numbers.
Y (list) – list of numerical labels (not one-hot)

class fastsk.old_utils.BlendedSpectrumRunner(exec_dir, data_locaton, prefix, outdir='./temp')[source]¶

Bases: object

combine_train_and_test()[source]¶

compute_kernel(k1=3, k2=5, mode='train_and_test')[source]¶

evaluate_clf()[source]¶

read_kernel()[source]¶

train_and_test(k1=3, k2=5, C=1)[source]¶

write_seq(datafile, mode='train')[source]¶

class fastsk.old_utils.DslUtility(vocab=None)[source]¶

Bases: object

read_data(data_file, vocab='inferred')[source]¶

class fastsk.old_utils.FastaUtility(vocab=None)[source]¶

Bases: object

read_data(data_file, vocab='inferred', regression=False)[source]¶

Read a file with the FASTA-like format of alternating labels lines followed by sequences. For example:

>1 >AAAGAT >1 >AAAAAGAT >0 >AGTC

Parameters

data_file (string) – The path to the sequences.
vocab (string) –

Returns

X (list) – list of sequences where characters have been mapped to numbers.
Y (list) – list of labels

shortest_seq(data_file)[source]¶

class fastsk.old_utils.FastskRegressor(dataset, data_location='../data')[source]¶

Bases: object

compute_train_kernel(g, m, t=20, approx=True, I=100, delta=0.025, skip_variance=False)[source]¶

train_and_test(g, m, t, approx, I=100, delta=0.025, skip_variance=False)[source]¶

class fastsk.old_utils.FastskRunner(prefix, data_location='../data')[source]¶

Bases: object

compute_train_kernel(g, m, t=20, approx=True, I=100, delta=0.025, skip_variance=False)[source]¶

evaluate_clf()[source]¶

train_and_test(g, m, t, approx, I=100, delta=0.025, skip_variance=False, C=1)[source]¶

class fastsk.old_utils.GaKCoRunner(exec_location, data_locaton, type_, prefix, outdir='./temp')[source]¶

Bases: object

combine_train_and_test()[source]¶

compute_kernel(g, m, mode='train', t=1)[source]¶

evaluate_clf()[source]¶

get_labels()[source]¶

read_kernel()[source]¶

read_labels()[source]¶

train_and_test(g, m, C=1)[source]¶

class fastsk.old_utils.GkmRunner(exec_location, data_locaton, dataset, g, k, approx=False, alphabet=None, outdir='./temp')[source]¶

Bases: object

classify()[source]¶

compute_train_kernel(t)[source]¶

evaluate()[source]¶

get_accuracy(pos_preds, neg_preds)[source]¶

get_auc(pos_preds, neg_preds)[source]¶

read_preds(file)[source]¶

train_and_test(t=20)[source]¶

train_svm()[source]¶

alphabet¶: Important note: gkmSVM’s -d parameter (max_m) is not the same as our m = g - k parameter. It’s actually the upper bound of the summation shown in equation 3 in the 2014 gkmSVM paper (ghandi2014enhanced).

g¶: Important note: gkmSVM’s -d parameter (max_m) is not the same as our m = g - k parameter. It’s actually the upper bound of the summation shown in equation 3 in the 2014 gkmSVM paper (ghandi2014enhanced).

k¶: Important note: gkmSVM’s -d parameter (max_m) is not the same as our m = g - k parameter. It’s actually the upper bound of the summation shown in equation 3 in the 2014 gkmSVM paper (ghandi2014enhanced).

max_m¶: If using the exact algo, the summation runs from 0 to l (their l is our g)

class fastsk.old_utils.Vocabulary[source]¶

Bases: object

A class for storing the vocabulary of a sequence dataset. Maps words or characters to indexes in the vocabulary.

add(token)[source]¶

Add a token to the vocabulary. :param token: a letter (for char-level model) or word (for word-level model) :param for which to create a mapping to an integer: :type for which to create a mapping to an integer: the idx

Returns: the index of the word. If it’s already present, return its index. Otherwise, add it before returning the index.

size()[source]¶: Return the number tokens in the vocabulary.

fastsk.old_utils.fastsk_wrap(dataset, g, m, t, approx, I, delta, skip_variance, C, return_dict)[source]¶

fastsk.old_utils.gkm_wrap(g, m, t, prefix, gkm_data, gkm_exec, approx, timeout, alphabet, return_dict)[source]¶

fastsk.old_utils.time_blended(k1, k2, prefix, timeout=None)[source]¶

fastsk.old_utils.time_fastsk(g, m, t, data_location, prefix, approx=False, max_iters=None, timeout=None, skip_variance=False)[source]¶: Run FastSK kernel computation. If a timeout is provided, it’ll run as a subprocess, which will be killed when the timeout is reached.

fastsk.old_utils.time_gakco(g, m, type_, prefix, timeout=None)[source]¶

fastsk.old_utils.time_gkm(g, m, t, prefix, gkm_data, gkm_exec, approx=False, timeout=None, alphabet=None)[source]¶: Run gkm-SVM2.0 kernel computation. If a timeout is provided, it’ll be run as a subprocess, which will be killed when the timeout is reached.

fastsk.old_utils.train_and_test_fastsk(dataset, g, m, t, approx, I=50, delta=0.025, skip_variance=False, C=1, timeout=None)[source]¶

fastsk.old_utils.train_and_test_gkm(g, m, t, prefix, gkm_data, gkm_exec, approx=False, timeout=None, alphabet=None)[source]¶

Utils for reading fasta files

class fastsk.utils.FastaUtility(vocab=None)[source]¶

Bases: object

read_data(data_file, vocab='inferred', regression=False)[source]¶

Read a file with the FASTA-like format of alternating labels lines followed by sequences. For example:

>1 >AAAGAT >1 >AAAAAGAT >0 >AGTC

Parameters

data_file (string) – The path to the sequences.
vocab (string) –

Returns

X (list) – list of sequences where characters have been mapped to numbers.
Y (list) – list of labels

shortest_seq(data_file)[source]¶

class fastsk.utils.Vocabulary[source]¶

Bases: object

A class for storing the vocabulary of a sequence dataset. Maps words or characters to indexes in the vocabulary.

add(token)[source]¶

Add a token to the vocabulary. :param token: a letter (for char-level model) or word (for word-level model) :param for which to create a mapping to an integer: :type for which to create a mapping to an integer: the idx

Returns: the index of the word. If it’s already present, return its index. Otherwise, add it before returning the index.

size()[source]¶: Return the number tokens in the vocabulary.