Machine Learning - dividing data into test and training suites
How to split a given dataset into training and test suites along with their correct labels.
There is a similar implementation via the sklearn library:
from sklearn.cross_validation import train_test_split
train, test = train_test_split(df, test_size = 0.2)
where df is the original dataset ... for example a list of strings
The problem is it doesn't accept target / label along with datasets. Therefore, we cannot track which label belongs to that data point ...
Is there a way to bind the data points and their labels and then split the datasets into a train and test?
source to share
sklearn.cross_validation.train_test_split
essentially accepts a variable number of arrays to be split into
* arrays : a sequence of arrays or scipy.sparse matrices with the same shape [0]
Returns :
split: list of arrays, length = 2 * len (arrays) A list containing the split of the train into an input array.
so you can just add a list of labels:
from sklearn import cross_validation
df = ['the', 'quick', 'brown', 'fox']
labels = [0, 1, 0, 0]
>> cross_validation.train_test_split(df, labels, test_size=0.2)
[['quick', 'fox', 'the'], ['brown'], [1, 0, 0], [0]]
source to share