데이터 샘플링

hyerimir 2024. 1. 21. 21:07

2024. 1. 21. 21:07

# 층화 임의 추출
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, shuffle = True, stratify = y, random_state = 48)
# shuffle 값이 True이면 무작위 추출, False이면 체계적 추출

df_t = pd.read_csv('../../data/titanic.csv')
df_t['survived'].value_counts()

# 오버샘플링
# SMOTE : Synthetic Minority Over-sampling Technique
from imblearn.over_sampling import SMOTE
! conda install -c conda-forge imbalanced-learn -y
# 여기서 마지막 -y는 모든 항목에 대해 yes라는 것을 의미

df_t = df_t.dropna()
# SMOTE를 사용할 때는 null값이 존재하면 안됨

X = df_t[['age', 'sibsp', 'parch', 'fare']]
y = df_t['survived']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, shuffle = True, random_state = 48)

# SMOTE 적용 전에, scaling 해주어야 함
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0,1))
X_train = scaler.fit_transform(X_train)

# 모델 설정
sm = SMOTE(k_neighbors = 5)
X_resampled, y_resampled = sm.fit_resample(X_train, y_train)

'ADP > 실기' 카테고리의 다른 글

결측치 , 이상치 처리하기 (0)	2024.01.23
데이터 표준화, 정규화 (0)	2024.01.23
날짜 데이터 핸들링 (0)	2024.01.21
데이터프레임에 함수 적용 (1)	2024.01.21
문자열 변환 (0)	2024.01.21

hyerimir_archive

데이터 샘플링

'ADP > 실기' 카테고리의 다른 글

+ Recent posts

티스토리툴바