Asked  7 Months ago    Answers:  5   Viewed   20 times

Being a new user here , my questions are not being fully answered due to not being reproducible. I read the thread relating to producing reproducible code but to avail. Specifically lost on how to use the dput() function.

Could someone provide a step by step on how to use the dput() using the iris df for eg it would be very helpful.

 Answers

38

Using the iris dataset, which is handily included into R, we can see how dput() works:

data(iris)
head(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Now we can get the whole dataset using dput(iris). In most situations, a whole dataset is unnecessary to provide for a Stackoverflow question, as a few lines of the relevant variables suffice as a working data example.

Two things come in handy: The head() function outputs only the first six rows of a dataframe/matrix. Also, the indexing in R (via brackets) allows you to select only specific columns.

Therefore, we can restrict the output of dput() using a combination of these two:

dput(head(iris[, c(1, 3)]))

structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4), 
    Petal.Length = c(1.4, 1.4, 1.3, 1.5, 1.4, 1.7)), .Names = c("Sepal.Length", 
"Petal.Length"), row.names = c(NA, 6L), class = "data.frame")

will give us the code to reproduce the first (up to) six rows of column 1 and 3 of the iris dataset.

df <- structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4), 
    Petal.Length = c(1.4, 1.4, 1.3, 1.5, 1.4, 1.7)), .Names = c("Sepal.Length", 
"Petal.Length"), row.names = c(NA, 6L), class = "data.frame")

> df
  Sepal.Length Petal.Length
1          5.1          1.4
2          4.9          1.4
3          4.7          1.3
4          4.6          1.5
5          5.0          1.4
6          5.4          1.7

If the first rows do not suffice, we can skip using head() and rely on indexing only:

dput(iris[1:20, c(1, 3)])

structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4, 4.6, 
5, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.7, 5.4, 5.1, 5.7, 5.1
), Petal.Length = c(1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 
1.5, 1.5, 1.6, 1.4, 1.1, 1.2, 1.5, 1.3, 1.4, 1.7, 1.5)), .Names = c("Sepal.Length", 
"Petal.Length"), row.names = c(NA, 20L), class = "data.frame")

will give us the the first twenty rows:

df <- structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4, 4.6, 
5, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.7, 5.4, 5.1, 5.7, 5.1
), Petal.Length = c(1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 
1.5, 1.5, 1.6, 1.4, 1.1, 1.2, 1.5, 1.3, 1.4, 1.7, 1.5)), .Names = c("Sepal.Length", 
"Petal.Length"), row.names = c(NA, 20L), class = "data.frame")

> df
   Sepal.Length Petal.Length
1           5.1          1.4
2           4.9          1.4
3           4.7          1.3
4           4.6          1.5
5           5.0          1.4
6           5.4          1.7
7           4.6          1.4
8           5.0          1.5
9           4.4          1.4
10          4.9          1.5
11          5.4          1.5
12          4.8          1.6
13          4.8          1.4
14          4.3          1.1
15          5.8          1.2
16          5.7          1.5
17          5.4          1.3
18          5.1          1.4
19          5.7          1.7
20          5.1          1.5
Tuesday, June 1, 2021
 
dkcwd
answered 7 Months ago
66

First make sure that you have the latest versions of the needed modules(e.g. scipy, numpy etc). When you type random.seed(1234), you use the numpy generator.


When you use random_state parameter inside the RandomForestClassifier, there are several options: int, RandomState instance or None.


From the docs here :

  • If int, random_state is the seed used by the random number generator;

  • If RandomState instance, random_state is the random number generator;

  • If None, the random number generator is the RandomState instance used by np.random.


A way to use the same generator in both cases is the following. I use the same (numpy) generator in both cases and I get reproducible results (same results in both cases).

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from numpy import *

X, y = make_classification(n_samples=1000, n_features=4,
                       n_informative=2, n_redundant=0,
                       random_state=0, shuffle=False)

random.seed(1234)
clf = RandomForestClassifier(max_depth=2)
clf.fit(X, y)

clf2 = RandomForestClassifier(max_depth=2, random_state = random.seed(1234))
clf2.fit(X, y)

Check if the results are the same:

all(clf.predict(X) == clf2.predict(X))
#True

Check after running the same code for 5 times:

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from numpy import *

for i in range(5):

    X, y = make_classification(n_samples=1000, n_features=4,
                       n_informative=2, n_redundant=0,
                       random_state=0, shuffle=False)

    random.seed(1234)
    clf = RandomForestClassifier(max_depth=2)
    clf.fit(X, y)

    clf2 = RandomForestClassifier(max_depth=2, random_state = random.seed(1234))
    clf2.fit(X, y)

    print(all(clf.predict(X) == clf2.predict(X)))

Results:

True
True
True
True
True
Saturday, August 14, 2021
 
Shane Hsu
answered 4 Months ago
36

O'Reilly's Practical RDF has a chatper titled Commercial Uses of RDF/XML. The table at the left lists the subsections: Chandler, RDF Gateway, Seamark, and Adobe's XMP stuff.

Sunday, September 5, 2021
 
subroutines
answered 3 Months ago
97

I've asked myself the same question before. I've found doctests to be of limited utility for things like views, model methods and managers because

  1. You need to be able to setup and teardown a test data set to actually use for testing
  2. Views need to take a request object. In a doctest, where does that come from?

For that reason, I've always used the Django unit testing framework which handles all this for you. Unfortunately, though, you don't get some of the benefits of the doctests and it makes TDD/BDD harder to do. What follows next is pure speculation about how you might make this work:

I think you'd want to grab doctests from their respective modules and functions and execute them within the unit testing framework. This would take care of test data setup/teardown. If your doctests were executed from within a test method of something that subclasses Django's unittest.TestCase they'd be able to use that test DB. You'd also be able to pass a mock request object into the doc test's execution context. Here's a Django snippet that provides a mock request object and info on it. Let's say you wanted to test the docstrings from all of an applications views. You could do something like this in tests.py :

from ??? import RequestFactory
from doctest import testmod, DocTestFailure
from django.test import TestCase

from myapp import views

class MyAppTest(TestCase):

    fixtures = ['test_data.json']

    def test_doctests(self):                
        try:
            testmod(views, extraglobs={
                'REQUEST': RequestFactory()
            }, raise_on_error=True)
        except DocTestFailure, e:
            self.fail(e)

This should allow you to do something like this:

def index(request):  
    """
    returns the top 10 most clicked products

    >>> response = index(REQUEST)
    >>> [test response content here]

    """     
    products = Product.objects.all()[:10]  
    products = match_pictures_with_products( products, 10)  .  
    return render_to_response('products/product_list.html', {'products': products})

Again, this is just off the top of my head and not at all tested, but it's the only way that I think you could what you want without just putting all your view tests in the unit testing framework.

Friday, October 1, 2021
 
Manju
answered 2 Months ago
34

The problem isn't limited to Colab, and is reproducible locally. The behavior, however, may be inevitable.

Code at bottom is a minimally-reproducible version of your code, with fit parameters tweaked for faster testing. What I observed is, the maximum difference for loss is only 0.0144% for 468 iterations per run, across 5 runs. This is pretty good. With batch_size=64, 60000 samples, and 20 epochs, you'll have 18750 iterations - which will amplify this figure substantially.

Regardless, GPU parallelism is the most likely culprit driving the randomnes - and the small differences do accumulate over time to yield a substantial difference - demo below. If 1e-8 seems small, try adding random noise to half your weights w/ magnitude clipped at 1e-8, and witness its life philosophy change.

The role of the seeds becomes dramatically pronounced if you don't use them - try it, all your metrics will fly rampant within the first 10 iterations. Also, loss is better for measuring runtime differences, as accuracy's lot more sensitive to numeric precision errors: the difference between 60% accuracy and 70% accuracy on a 10-sample batch is a prediction that differs by 0.000001 w.r.t. 0.5 - but loss will barely budge.

Lastly, note that your hyperparameter choice will have a far greater impact upon model performance than randomness; no matter how many seeds you throw, they won't magic a model into SOTA. -- I recommend this fine clip.


Your code - is fine. You've taken all practical steps to ensure reproducibility, with an exception: PYTHONHASHSEED must be set before your Python kernel starts.


What can you do to reduce randomness?

  1. Repeat runs, average results. Understandably that's expensive, but note that even a perfectly reproducible run isn't perfectly informative, as model variance w.r.t. train & validation sets is likely to be much greater than noise-induced randomness

  2. K-Fold Cross-Validation: can mitigate both data & noise variance significantly

  3. Larger validation set: extracted features can differ only so much due to noise; the larger the validation set, the less small perturbations in weights should reflect in metrics


GPU Parallelism: amplifying float error

print(2. * 11. / 9.)  # 2.4444444444444446
print(2. / 9. * 11.)  # 2.444444444444444

Order of operations matters, and by exploiting multithreading, GPU parallelism gives no guarantee whatsoever of operations being executed in the same order. On a first look, the difference may look innocent - but give it enough iterations ...

one = 1
for _ in range(int(1e8)):
    one *= (2. / 9. * 11.) / (2. * 11. / 9.)
print(one)     # 0.9999999777955395
print(1 - one) # 1.8167285897874308e-08

... and a "one" is a typical small weight value of 1e-08 away from being its original self. If 100 million iterations seems to be a stretch, consider that the operation completed in ~half a minute, whereas your model can train over an hour, and former runs entirely on CPU.


Minimal reproducible experimentation:

import tensorflow as tf
import random as rn 
import numpy as np
np.random.seed(1)   
rn.seed(2)   
tf.set_random_seed(3)

from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten, BatchNormalization
from keras.layers import MaxPooling2D, Conv2D
from keras.optimizers import Adam

def model_cnn():
  model = Sequential()
  model.add(Conv2D(32, kernel_size=(3,3), 
                   kernel_initializer='he_uniform', input_shape=(28,28,1)))
  model.add(BatchNormalization())
  model.add(Activation('relu'))
  model.add(Conv2D(32, kernel_size=(3,3), kernel_initializer='he_uniform'))
  model.add(BatchNormalization())
  model.add(Activation('relu'))
  model.add(MaxPooling2D(pool_size=(2,2)))
  model.add(Dropout(0.25))
  model.add(Flatten())
  model.add(Dense(512, kernel_initializer='he_uniform'))
  model.add(BatchNormalization())
  model.add(Activation('relu'))
  model.add(Dropout(0.5))
  model.add(Dense(10, kernel_initializer='he_uniform'))
  model.add(Activation('softmax'))
  model.compile(loss="categorical_crossentropy", optimizer=Adam(lr=0.001), 
                metrics=['accuracy'])
  return model

np.random.seed(1)   
rn.seed(2)     
tf.set_random_seed(3) 

X_train = np.random.randn(30000, 28, 28, 1)
y_train = np.random.randint(0, 2, (30000, 10))
X_val   = np.random.randn(30000, 28, 28, 1)
y_val   = np.random.randint(0, 2, (30000, 10))
model = model_cnn()

np.random.seed(1)   
rn.seed(2)   
tf.set_random_seed(3)

history = model.fit(X_train, y_train, batch_size=64,shuffle=True, 
                    epochs=1, verbose=1, validation_data=(X_val,y_val))

Run differences:

loss: 12.5044 - acc: 0.0971 - val_loss: 11.5389 - val_acc: 0.1051
loss: 12.5047 - acc: 0.0958 - val_loss: 11.5369 - val_acc: 0.1018
loss: 12.5055 - acc: 0.0955 - val_loss: 11.5382 - val_acc: 0.0980
loss: 12.5042 - acc: 0.0961 - val_loss: 11.5382 - val_acc: 0.1179
loss: 12.5062 - acc: 0.0960 - val_loss: 11.5366 - val_acc: 0.1082
Tuesday, October 12, 2021
 
Nishad
answered 2 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :
 
Share