'빅데이터분석기사' 카테고리의 글 목록

https://www.datamanim.com/dataset/practice/q3.html

3회차 — DataManim

Toggle in-page Table of Contents

www.datamanim.com

Import

import os
import pandas as pd
import numpy as np
import datetime 
from dateutil.relativedelta import relativedelta
!pip install tqdm
import tqdm
import zipfile
import re
from tqdm import tqdm

작업 1유형

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/spotify/spotify.csv')
df.head()

데이터 설명 : 2010-2019 스포티파이 TOP100 노래

Question1

데이터는 현재 년도별 100곡이 인기순으로 정렬되어 있다.

각 년도별 1~100위의 랭킹을 나타내는 rank컬럼을 만들고 매년도 1위의 bpm컬럼의 평균값을 구하여라

lst = list(range(1,11))
lst

lst = list(range(1,11))*2
lst

df = df.dropna()
df.loc[:,'rank'] = list(range(1,101))*10
result = df[df['rank'] ==1].bpm.mean()
print(result)

Question2

2015년도에 가장많은 top100곡을 올린 artist는 누구인가?

result = df[df['top year'] ==2015].artist.value_counts().index[0]
print(result)

Question3

년도별 rank값이 1~10위 까지의 곡들 중 두번째로 많은 top genre는 무엇인가?

result = df[df['rank'].isin(range(1,11))]['top genre'].value_counts().index[2]
print(result)

Question4

피처링의 경우 title에 표시된다. 피처링을 가장 많이 해준 가수는 누구인가?

df.title.str.split('feat.').str[1]

df.title.str.split('feat.').str[1].dropna()

df.title.str.split('feat.').str[1].dropna().str[:-1]

result = df.title.str.split('feat.').str[1].dropna().str[:-1].str.strip().value_counts().index[0]
print(result)

Question5

top year 년도를 기준으로 발매일(year released)과 top100에 진입한 일자 (top year)가 다른 곡의 숫자를 count 했을때 가장 많은 년도는?

year = int(df[df['year released'] != df['top year']]['top year'].value_counts().index[0])
print(year)

Question6

artist 컬럼의 값에 대소문자 상관없이 q 단어가 들어가는 아티스트는 몇명인가?

result = df[df.artist.str.lower().str.contains('q')].artist.nunique()
print(result)

Question7

년도 상관없이 전체데이터에서 1~50위와 51~100위간의 dur 컬럼의 평균값의 차이는?

result = df[df['rank'].isin(range(1,51))].dur.mean() - df[df['rank'].isin(range(51,101))].dur.mean()
print(result)

Question8

title을 띄어쓰기 단어로 구분 했을때 가장 많이 나온 단어는 무엇인가? (대소문자 구분 x)

df.title.str.split('\(feat').str[0]

df.title.str.split('\(feat').str[0].str.split().explode()

result = df.title.str.split('\(feat').str[0].str.split().explode().str.lower().value_counts().index[0]
print(result)

리스트 형태의 값 전개 explode

목록의 각 요소를 행으로 반환하여 인덱스 값을 복제

Question9

년도별 nrgy값의 평균값을 구할때 최대 평균값과 최소 평균값의 차이를 구하여라

m = df.groupby(['top year']).nrgy.mean().sort_values().values
m

m = df.groupby(['top year']).nrgy.mean().sort_values().values
result = m[-1] - m[0]
print(result)

Question10

artist중 artist type 타입을 여러개 가지고 있는 artist는 누구인가

 df[['artist','artist type']].value_counts().reset_index()

df[['artist','artist type']].value_counts().reset_index().artist.value_counts()

result = df[['artist','artist type']].value_counts().reset_index().artist.value_counts().index[0]
print(result)

작업 2유형

import pandas as pd
#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/muscle/x_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/muscle/y_train.csv")
test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/muscle/x_test.csv")


display(x_train.head())
display(y_train.head())

데이터 설명 : 센서데이터로 동작 유형 분류(종속변수 pose : 0, 1 구분)

x = x_train.drop(columns = ['ID'])
test_drop = test.drop(columns = ['ID'])

sc = StandardScaler()
sc.fit(x)

xs = sc.transform(x)
x_test_scaler = sc.transform(test_drop)

scaler fit은 train데이터로, transform은 train, test 데이터

scaler 처리 한다면 train_test_split 처리하기 전에 진행

# train_test_split
X_train, X_test, y_train, y_test = train_test_split(xs, y_train['pose'], test_size = 0.33, random_state = 42)

lr = LogisticRegression()
lr.fit(X_train, y_train)

pred = lr.predict_proba(X_test)
print('validation_auc : ', roc_auc_score(y_test, pred[:,1]))

# # 아래 코드 예측변수와 수험번호를 개인별로 변경하여 활용
# # pd.DataFrame({'id': test.id, 'stroke': pred}).to_csv('003000000.csv', index=False)
#pd.DataFrame({'id': test.ID, 'pose': lr.predict_proba(x_test_scaler)[:,1]}).to_csv('003000000.csv', index=False)

'빅데이터분석기사 > 모의고사' 카테고리의 다른 글

실기 모의고사 2회차 (0)	2022.08.17
실기 모의고사 1회차 (0)	2022.08.17

https://www.datamanim.com/dataset/practice/q2.html

2회차 — DataManim

Toggle in-page Table of Contents

www.datamanim.com

Import

import os
import pandas as pd
import numpy as np
import datetime 
from dateutil.relativedelta import relativedelta
!pip install tqdm
import tqdm
import zipfile
import re
from tqdm import tqdm

작업 1유형

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/stroke_/train.csv')
df.head()

데이터 설명 : 뇌졸중 발생 여부 예측

Question1

성별이 Male인 환자들의 age의 평균값은 ?

df['age']

df['age'] = df['age'].str.replace('*','').astype('int')
result = df[df.gender =='Male'].age.mean()
print(result)

Question2

bmi컬럼의 결측치를 bmi컬럼의 결측치를 제외한 나머지 값들의 중앙값으로 채웠을 경우 bmi 컬럼의 평균을 소숫점 이하 3자리 까지 구하여라

fi = df['bmi'].fillna(df['bmi'].median())
result = round(fi.mean(),3)
print(result)

Question3

bmi컬럼의 각 결측치들을 직전의 행의 bmi값으로 채웠을 경우 bmi 컬럼의 평균을 소숫점 이하 3자리 까지 구하여라

fi = df['bmi'].fillna(method = 'ffill')
result = round(fi.mean(),3)
print(result)

Question4

bmi컬럼의 각 결측치들을 결측치를 가진 환자 나이대(10단위)의 평균 bmi 값으로 대체한 후 대체된 bmi 컬럼의 평균을 소숫점 이하 3자리 까지 구하여라

# 결측치 제외 나이대별 평균값 계산 및 dictionary 형태로 변환
mean = df[df.bmi.notnull()].groupby(df.age//10 *10).bmi.mean()
dic = { x:y for x,y in mean.items()}
mean

items()

파이썬 딕셔너리 키, 값 쌍 얻기

dic

idx =df.loc[df.bmi.isnull(),['age','bmi']].index
idx

# 결측치 매핑
df.loc[df.bmi.isnull(),'bmi']  =(df[df.bmi.isnull()].age//10*10).map(lambda x : dic[x])

result = df.bmi.mean()
print(result)

Question5

avg_glucose_level 컬럼의 값이 200이상인 데이터를 모두 199로 변경하고 stroke값이 1인 데이터의 avg_glucose_level값의 평균을 소수점이하 3자리 까지 구하여라

df.loc[df.avg_glucose_level >=200,'avg_glucose_level'] =199
result = round(df[df.stroke ==1].avg_glucose_level.mean(),3)
print(result)

작업 1유형_다른 데이터

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/pok/Pokemon.csv')
df.head()

데이터 설명 : 포켓몬 정보

Question6

Attack컬럼의 값을 기준으로 내림차순정렬 했을때 상위 400위까지 포켓몬들과 401~800위까지의 포켓몬들에서 전설포켓몬(Legendary컬럼)의 숫자 차이는?

up = df.sort_values('Attack',ascending=False).reset_index(drop=True)[:400]
down = df.sort_values('Attack',ascending=False).reset_index(drop=True)[400:]

result = up.Legendary.sum() - down.Legendary.sum()
print(result)

Question7

Type 1 컬럼의 종류에 따른 Total 컬럼의 평균값을 내림차순 정렬했을때 상위 3번째 Type 1은 무엇인가?

df.groupby(['Type 1']).Total.mean().sort_values(ascending=False)

result = df.groupby(['Type 1']).Total.mean().sort_values(ascending=False).index[2]
print(result)

Question8

결측치가 존재하는 행을 모두 지운 후 처음부터 순서대로 60% 데이터를 추출하여 Defense컬럼의 1분위수를 구하여라

result = df.dropna()[:int(len(df.dropna()) *0.6)].Defense.quantile(.25)
print(result)

Question9

Type 1 컬럼의 속성이 Fire인 포켓몬들의 Attack의 평균이상인 Water속성의 포켓몬 수를 구하여라

target = df[df.Attack >= df[df['Type 1'] =='Fire'].Attack.mean()]
result = target[target['Type 1']=='Water'].shape[0]
print(result)

Question10

각 세대 중(Generation 컬럼)의 Speed와 Defense 컬럼의 차이(절댓값)이 가장 큰 세대는?

df.groupby(['Generation'])[['Speed','Defense']].mean()

df.groupby(['Generation'])[['Speed','Defense']].mean().T

df.groupby(['Generation'])[['Speed','Defense']].mean().T.diff().T

result = abs(df.groupby(['Generation'])[['Speed','Defense']].mean().T.diff().T).sort_values('Defense').index[-1]
print(result)

작업 2유형

import pandas as pd
train= pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/stroke_/train.csv')
test= pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/stroke_/test.csv')

display(train.head())
display(test.head())

데이터 설명 : 뇌졸중 발생 여부 예측

간단한 모델링 작업

#import
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, classification_report
from sklearn.ensemble import RandomForestClassifier

#전처리
train['age'] = train['age'].str.replace('*', '').astype('int')
train['bmi'] = train['bmi'].fillna(train['bmi'].mean())
test['bmi'] = test['bmi'].fillna(test['bmi'].mean())

x = train.drop(columns = ['id', 'stroke'])
xd = pd.get_dummies(x)
y = train['stroke']

#학습
x_train, x_test, y_train, y_test = train_test_split(xd, y, stratify = y, random_state = 1)
rf = RandomForestClassifier()
rf.fit(x_train, y_train)
pred = rf.predict_proba(x_test)
print('test roc score  : ', roc_auc_score(y_test, pred[:,1]))

# one-hot encoding시 train셋에만 존재하는 컬럼이 존재
test_preprocessing = pd.get_dummies(test.drop(columns=['id']))
test_preprocessing

list(set(x_train.columns) -set(test_preprocessing))

test_preprocessing[list(set(x_train.columns) -set(test_preprocessing))] =0
test_preprocessing

train에 새로운 column을 추가했으면 test에도 똑같이 추가해준다

test_preprocessing =test_preprocessing[x_train.columns]
test_pred = rf.predict_proba(test_preprocessing)
test_pred

# 아래 코드 예측변수와 수험번호를 개인별로 변경하여 활용
# pd.DataFrame({'id': test.id, 'stroke': pred}).to_csv('003000000.csv', index=False)
#pd.DataFrame({'id': test.id, 'stroke': test_pred[:,1]}).to_csv('003000000.csv', index=False)

'빅데이터분석기사 > 모의고사' 카테고리의 다른 글

실기 모의고사 3회차 (0)	2022.08.17
실기 모의고사 1회차 (0)	2022.08.17

https://www.datamanim.com/dataset/practice/q1.html

1회차 — DataManim

Toggle in-page Table of Contents

www.datamanim.com

Import

import os
import pandas as pd
import numpy as np
import datetime 
from dateutil.relativedelta import relativedelta
!pip install tqdm
import tqdm
import zipfile
import re
from tqdm import tqdm

작업 1유형

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/bank/train.csv')
df.head()

데이터 설명 : 은행의 전화 마케팅에 대한 고객의 반응 여부

df.shape

Question1

마케팅 응답 고객들의 나이를 10살 단위로 변환 했을 때, 가장 많은 인원을 가진 나이대는? (0~9 : 0 , 10~19 : 10)

df.age//10 *10

// 정수 몫

(df.age//10 *10).value_counts()

result = (df.age//10 *10).value_counts().index[0]
print(result)

Question2

마케팅 응답 고객들의 나이를 10살 단위로 변환 했을 때, 가장 많은 나이대 구간의 인원은 몇명인가?

result = (df.age//10 *10).value_counts().values[0]
print(result)

위의 두 question 차이점

value_counts().index냐 value_counts.values냐

Question3

나이가 25살 이상 29살 미만인 응답 고객들중 housing컬럼의 값이 yes인 고객의 수는?

result = df[(df.age >=25) & (df.age<29) & (df.housing =='yes')].shape[0]
print(result)

Question4

numeric한 값을 가지지 않은 컬럼들중 unique한 값을 가장 많이 가지는 컬럼은?

df.select_dtypes(exclude='int')

for col in df.select_dtypes(exclude='int'):
    print(col)

lst= [] 
for col in df.select_dtypes(exclude='int'):
    target = df[col]
    lst.append([col,target.nunique()])

pd.DataFrame(lst)

result = pd.DataFrame(lst).sort_values(1,ascending=False).values[0][0]
print(result)

Question5

balance 컬럼값들의 평균값 이상을 가지는 데이터를 ID값을 기준으로 내림차순 정렬했을때 상위 100개 데이터의 balance값의 평균은?

result = df[df.balance >= df.balance.mean()].sort_values('ID',ascending=False).head(100).balance.mean()
print(result)

Question6

가장 많은 광고를 집행했던 날짜는 언제인가? (데이터 그대로 일(숫자),달(영문)으로 표기)

df[['day','month']].value_counts()

result = df[['day','month']].value_counts().index[0]
print(result)

Question7

데이터의 job이 unknown 상태인 고객들의 age 컬럼 값의 정규성을 검정하고자 한다.

샤피로 검정의 p-value값을 구하여라

from scipy.stats import shapiro
shapiro(df[df.job =='unknown'].age)

from scipy.stats import shapiro
result = shapiro(df[df.job =='unknown'].age)[1]
print(result)

샤피로윌크 검정(shapiro-wilk test)

통계학에서 정규성을 검정

Question8

age와 balance의 상관계수를 구하여라

result = df[['age','balance']].corr().iloc[0,1]
print(result)

Question9

y 변수와 education 변수는 독립인지 카이제곱검정을 통해 확인하려한다.

p-value값을 출력하라

카이제곱 검정(교차분석)

크로스탭 만들고 시작

v = pd.crosstab(df.y,df.education)
v

chi2_contingency(v)

결과해석

(1) chi-squre값(소수점 셋째짜리까지 표현)

(2) p-value

(3) df

(4) 기대치(expected value)

from scipy.stats import chi2_contingency
chi2 , p ,dof, expected = chi2_contingency(v)
print(p)

Question10

각 job에 따라 divorced/married 인원의 비율을 확인 했을 때 그 값이 가장 높은 값은?

t = df.groupby(['job','marital']).size().reset_index()
t

t.pivot_table(index='job',columns='marital')

t = df.groupby(['job','marital']).size().reset_index()
pivotdf = t.pivot_table(index='job',columns='marital')[0]
pivotdf = pivotdf.fillna(0)
pivotdf['ratio'] = pivotdf['divorced'] / pivotdf['married']

result = pivotdf.sort_values('ratio').ratio.values[-1]
print(result)

작업 2유형

import pandas as pd
train= pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/bank/train.csv')
test= pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/bank/test.csv')
submission= pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/bank/submission.csv')

display(train.head())
display(test.head())
display(submission.head())

train/test/submission

은행의 전화 마케팅에 대한 고객의 반응 여부

간단한 모델링 작업

모델링 및 submission 파일 생성까지

train_test_split

from sklearn.model_selection import train_test_split
x = train.drop(columns=['ID','y'])
xd = pd.get_dummies(x)
y = train['y']

x_train, x_test, y_train, y_test = train_test_split(xd, y, stratify = y, random_state = 1)

pd.get_dummies()
object 형의 데이터가 있다면
(1) 먼저 수치형 데이터로 변환을 해주고(0,1,2,3,...)
(2) 그 다음 수치화된 데이터를 가변수화하여 나타내준다
그래야 기계학습에 적합한 데이터의 형태로 가공된다

수치형 데이터로 변환하고 끝내면 서로 간의 관계성이 생기게 된다

train_test_split
train_test를 분리하는 목적을 정확히 알아야 한다
정확하게 말하면 train/test가 아닌 train/validation으로 볼 수 있다
머신러닝 모델에 train 데이터를 100% 학습시킨 후 test 데이터에 모델을 적용했을 때 성능이 생각보다 나오지 않는 경우가 많다 > overfitting된 경우

parameters;
test_size : 테스트 셋 구성의 비율을 나타낸다(default = 0.25)
suffle : default 값이 True
stratify : default 값이 False, classification을 다룰 대 매우 중요한 옵션 값이다. stratify 값을 target으로 지정해주면 class 비율을 train/validation에 유지해준다. 즉 한 쪽에 쏠려서 분배되는 것을 방지
random_state = int

RandomForestClassifier

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(x_train, y_train)
pred = rf.predict_proba(x_test)
pred

머신러닝 선능 평가에 활용되는 지표들

정확도(Auccuracy) : 실제 데이터와 예측 데이터가 얼마나 같은지를 판단하는 지표
오차행렬/혼동행렬(Confusion Matrix) : 분류 문제에서 예측 오류가 얼마인지, 어떤 유형의 오류가 발생하고 있는지를 함께 나타내는 지표
- 오차행렬을 통해 알 수 있는 지표들 : 정확도, 정밀도, 재현율, 민감도, 특이성
정밀도/재현율 트레이드오프 : 정밀도와 재현율은 상호보완적인 지표로 한쪽을 높이려고 하다보면 다른 한쪽이 떨어지기 쉬움
사이킷런의 분류 알고리즘은 예측 데이터가 특정 레이블에 속하는지 판단하기 위해 개별 레이블별로 확률을 구하고 그 확률이 큰 레이블 값으로 예측
일반적으로 임계값을 50%로 정하고 이보다 크면 positive, 작으면 negative
predict_proba()를 통하여 개별 레이블별  예측 확률을 반환받을 수 있음
F1스코어 : 정밀도와 재현율을 결합한 지표로 정밀도와 재현율이 어느 한쪽으로 치우치지 않을 때 상대적으로 높은 값을 가짐
ROC곡선과 AUC

metrics

from sklearn.metrics import roc_auc_score, classification_report
print('test roc score : ', roc_auc_score(y_test, pred[:,1]))

test_pred = rf.predict_proba(pd.get_dummies(test.drop(columns = ['ID'])))
submission['predict'] = test_pred[:,1]

print('submission file')
display(submission.head())
#submission.to_csv('dd.csv', index = False)

'빅데이터분석기사 > 모의고사' 카테고리의 다른 글

실기 모의고사 3회차 (0)	2022.08.17
실기 모의고사 2회차 (0)	2022.08.17

https://www.datamanim.com/dataset/03_dataq/typeone.html#id6

작업 1유형 — DataManim

Question 각 비디오는 10분 간격으로 구독자수, 좋아요, 싫어요수, 댓글수가 수집된것으로 알려졌다. 공범 EP1의 비디오정보 데이터중 수집간격이 5분 이하, 20분이상인 데이터 구간( 해당 시점 전,후

www.datamanim.com

Question

quality 값이 3인 그룹과 8인 데이터그룹의 각 컬럼별 독립변수의 표준편차 값의 차이를 구할때 그값이 가장 큰 컬럼명을 구하여라

import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/redwine/train.csv")
df.head()

answer = (df.loc[df.quality ==8].std() -df.loc[df.quality ==3].std()).sort_values().index[-1]
print(answer)

'빅데이터분석기사 > 작업 1유형' 카테고리의 다른 글

대학원 입학가능성 데이터 (0)	2022.08.15
킹카운티 주거지 가격 예측 문제 데이터 (0)	2022.08.15
의료 비용 예측 데이터 (0)	2022.08.15
수질 음용성 여부 데이터 (0)	2022.08.15
비행 탑승 경험 만족도 데이터 (0)	2022.08.15

https://www.datamanim.com/dataset/03_dataq/typeone.html#id6

작업 1유형 — DataManim

Question 각 비디오는 10분 간격으로 구독자수, 좋아요, 싫어요수, 댓글수가 수집된것으로 알려졌다. 공범 EP1의 비디오정보 데이터중 수집간격이 5분 이하, 20분이상인 데이터 구간( 해당 시점 전,후

www.datamanim.com

Question

Serial No. 컬럼을 제외하고 ‘Chance of Admit’을 종속변수, 나머지 변수를 독립변수라 할때, 랜덤포레스트를 통해 회귀 예측을 할 떄 변수중요도 값을 출력하라 (시드값에 따라 순서는 달라질수 있음)

import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/admission/train.csv")
df.head()

from sklearn.ensemble import RandomForestRegressor

df_t = df.drop([df.columns[0]],axis=1)
x = df_t.drop([df.columns[-1]],axis=1)
y = df_t[df.columns[-1]]

ml = RandomForestRegressor()

ml.fit(x,y)

result=pd.DataFrame({'importance':ml.feature_importances_},x.columns).sort_values('importance',ascending=False)
display(result)

'빅데이터분석기사 > 작업 1유형' 카테고리의 다른 글

레드 와인 퀄리티 예측 데이터 (0)	2022.08.16
킹카운티 주거지 가격 예측 문제 데이터 (0)	2022.08.15
의료 비용 예측 데이터 (0)	2022.08.15
수질 음용성 여부 데이터 (0)	2022.08.15
비행 탑승 경험 만족도 데이터 (0)	2022.08.15

https://www.datamanim.com/dataset/03_dataq/typeone.html#id6

작업 1유형 — DataManim

Question 각 비디오는 10분 간격으로 구독자수, 좋아요, 싫어요수, 댓글수가 수집된것으로 알려졌다. 공범 EP1의 비디오정보 데이터중 수집간격이 5분 이하, 20분이상인 데이터 구간( 해당 시점 전,후

www.datamanim.com

Question

bedrooms 의 빈도가 가장 높은 값을 가지는 데이터들의 price의 상위 10%와 하위 10%값의 차이를 구하여라

import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/kingcountyprice//train.csv")
df.head()

answer = df.loc[df.bedrooms ==df.bedrooms.value_counts().index[0]].price.quantile(0.9) \
-\
df.loc[df.bedrooms ==df.bedrooms.value_counts().index[0]].price.quantile(0.1)
print(answer)

'빅데이터분석기사 > 작업 1유형' 카테고리의 다른 글

레드 와인 퀄리티 예측 데이터 (0)	2022.08.16
대학원 입학가능성 데이터 (0)	2022.08.15
의료 비용 예측 데이터 (0)	2022.08.15
수질 음용성 여부 데이터 (0)	2022.08.15
비행 탑승 경험 만족도 데이터 (0)	2022.08.15

https://www.datamanim.com/dataset/03_dataq/typeone.html#id6

작업 1유형 — DataManim

Question 각 비디오는 10분 간격으로 구독자수, 좋아요, 싫어요수, 댓글수가 수집된것으로 알려졌다. 공범 EP1의 비디오정보 데이터중 수집간격이 5분 이하, 20분이상인 데이터 구간( 해당 시점 전,후

www.datamanim.com

Question

흡연자와 비흡연자 각각 charges의 상위 10% 그룹의 평균의 차이는?

import pandas as pd
train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/MedicalCost/train.csv")
train.head()

high = train.loc[train.smoker =='yes'].charges.quantile(0.9)
high2 = train.loc[train.smoker =='no'].charges.quantile(0.9)
mean_yes = train.loc[(train.smoker =='yes') &(train.charges >=high)].charges.mean()
mean_no = train.loc[(train.smoker =='no') &(train.charges >=high2)].charges.mean()
answer = mean_yes - mean_no
print(answer)

'빅데이터분석기사 > 작업 1유형' 카테고리의 다른 글

대학원 입학가능성 데이터 (0)	2022.08.15
킹카운티 주거지 가격 예측 문제 데이터 (0)	2022.08.15
수질 음용성 여부 데이터 (0)	2022.08.15
비행 탑승 경험 만족도 데이터 (0)	2022.08.15
핸드폰 가격 예측 데이터 (0)	2022.08.15

https://www.datamanim.com/dataset/03_dataq/typeone.html#id6

작업 1유형 — DataManim

Question 각 비디오는 10분 간격으로 구독자수, 좋아요, 싫어요수, 댓글수가 수집된것으로 알려졌다. 공범 EP1의 비디오정보 데이터중 수집간격이 5분 이하, 20분이상인 데이터 구간( 해당 시점 전,후

www.datamanim.com

Question

ph값은 상당히 많은 결측치를 포함한다.

결측치를 제외한 나머지 데이터들 중 사분위값 기준 하위 25%의 값들의 평균값은?

import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/waters/train.csv")
df.head()

target = df['ph'].dropna()
answer =target.loc[target <= target.quantile(0.25)].mean()
print(answer)

'빅데이터분석기사 > 작업 1유형' 카테고리의 다른 글

킹카운티 주거지 가격 예측 문제 데이터 (0)	2022.08.15
의료 비용 예측 데이터 (0)	2022.08.15
비행 탑승 경험 만족도 데이터 (0)	2022.08.15
핸드폰 가격 예측 데이터 (0)	2022.08.15
자동차 보험가입 예측데이터 (0)	2022.08.15

hyerimir_archive