'빅데이터분석기사' 카테고리의 글 목록 (3 Page)

https://www.datamanim.com/dataset/03_dataq/typeone.html#id6

작업 1유형 — DataManim

Question 각 비디오는 10분 간격으로 구독자수, 좋아요, 싫어요수, 댓글수가 수집된것으로 알려졌다. 공범 EP1의 비디오정보 데이터중 수집간격이 5분 이하, 20분이상인 데이터 구간( 해당 시점 전,후

www.datamanim.com

Question

DateTime컬럼을 통해 각 월별로 몇개의 데이터가 있는지 데이터 프레임으로 구하여라

import pandas as pd
df= pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/consum/Tetuan%20City%20power%20consumption.csv')
df.head()

df.info()

df['DateTime'] = pd.to_datetime(df['DateTime'])
result = df['DateTime'].dt.month.value_counts().sort_index().to_frame()
print(result)

Question

3월달의 각 시간대별 온도의 평균들 중 가장 낮은 시간대의 온도를 출력하라

target = df[df.DateTime.dt.month ==3]
result = target.groupby(target.DateTime.dt.hour)['Temperature'].mean().sort_values()
result

target = df[df.DateTime.dt.month ==3]
result = target.groupby(target.DateTime.dt.hour)['Temperature'].mean().sort_values().values[0]
print(result)

Question

3월달의 각 시간대별 온도의 평균들 중 가장 높은 시간대의 온도를 출력하라

target = df[df.DateTime.dt.month ==3]
result = target.groupby(target.DateTime.dt.hour)['Temperature'].mean().sort_values().values[-1]
print(result)

Question

Zone 1 Power Consumption 컬럼의 value값의 크기가 Zone 2 Power Consumption 컬럼의 value값의 크기보다 큰 데이터들의 Humidity의 평균을 구하여라

result = df[df['Zone 1 Power Consumption'] > df['Zone 2  Power Consumption']].Humidity.mean()
print(result)

Question

각 zone의 에너지 소비량의 상관관계를 구해서 데이터 프레임으로 표기하라

result = df.iloc[:,-3:].corr()
display(result)

Question

Temperature의 값이 10미만의 경우 A, 10이상 20미만의 경우 B,20이상 30미만의 경우 C, 그 외의 경우 D라고 할때 각 단계의 데이터 숫자를 구하여라

def split_data(x):
    if x<10:
        return "A"
    elif x<20:
        return 'B'
    elif x<30:
        return 'C'
    else:
        return 'D'
    
df['sp'] = df.Temperature.map(split_data)
result = df['sp'].value_counts()
display(result)

Question

6월 데이터중 12시의 Temperature의 표준편차를 구하여라

result =df[(df.DateTime.dt.month ==6) & (df.DateTime.dt.hour ==12)].Temperature.std()
print(result)

Question

6월 데이터중 12시의 Temperature의 분산을 구하여라

result =df[(df.DateTime.dt.month ==6) & (df.DateTime.dt.hour ==12)].Temperature.var()
print(result)

Question

Temperature의 평균이상의 Temperature의 값을 가지는 데이터를 Temperature를 기준으로 정렬 했을때 4번째 행의 Humidity 값은?

result = df[df.Temperature >= df.Temperature.mean()].sort_values('Temperature').Humidity.values[3]
print(result)

Question

**Temperature의 중간값 이상의 Temperature의 값을 가지는 데이터를Temperature를 기준으로 정렬 했을때 4번째 행의 Humidity 값은?

**

result = df[df.Temperature >= df.Temperature.median()].sort_values('Temperature').Humidity.values[3]
print(result)

'빅데이터분석기사 > 작업 1유형' 카테고리의 다른 글

대한민국 체력장 데이터 (0)	2022.08.15
포켓몬 정보 데이터 (0)	2022.08.15
전세계 행복도 지표 데이터 (0)	2022.08.15
서울시 따릉이 이용정보 데이터 (0)	2022.08.15
월드컵 출전선수 골기록 데이터 (0)	2022.08.14

https://www.datamanim.com/dataset/03_dataq/typeone.html#id6

작업 1유형 — DataManim

Question 각 비디오는 10분 간격으로 구독자수, 좋아요, 싫어요수, 댓글수가 수집된것으로 알려졌다. 공범 EP1의 비디오정보 데이터중 수집간격이 5분 이하, 20분이상인 데이터 구간( 해당 시점 전,후

www.datamanim.com

Question

데이터는 2018년도와 2019년도의 전세계 행복 지수를 표현한다.

각년도의 행복랭킹 10위를 차지한 나라의 행복점수의 평균을 구하여라

import pandas as pd
df =pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/happy2/happiness.csv',encoding='utf-8')
df.head()

df.shape

result = df[df.행복랭킹 ==10]['점수'].mean()
print(result)

Question

데이터는 2018년도와 2019년도의 전세계 행복 지수를 표현한다.

각년도의 행복랭킹 50위이내의 나라들의 각각의 행복점수 평균을 데이터프레임으로 표시하라

result = df[df.행복랭킹<=50][['년도','점수']].groupby('년도').mean()
print(result)

Question

2018년도 데이터들만 추출하여 행복점수와 부패에 대한 인식에 대한 상관계수를 구하여라

df[df.년도 ==2018][['점수','부패에 대한인식']].corr()

result = df[df.년도 ==2018][['점수','부패에 대한인식']].corr().iloc[0,1]
print(result)

Question

2018년도와 2019년도의 행복랭킹이 변화하지 않은 나라명의 수를 구하여라

df[['행복랭킹','나라명']].drop_duplicates()

result = len(df[['행복랭킹','나라명']]) - len(df[['행복랭킹','나라명']].drop_duplicates())
print(result)

Question

2019년도 데이터들만 추출하여 각변수간 상관계수를 구하고 내림차순으로 정렬한 후 상위 5개를 데이터 프레임으로 출력하라. 컬럼명은 v1,v2,corr으로 표시하라

df[df.년도 ==2019].corr()

df[df.년도 ==2019].corr().unstack()

df[df.년도 ==2019].corr().unstack().to_frame()

df[df.년도 ==2019].corr().unstack().to_frame().reset_index()

zz = df[df.년도 ==2019].corr().unstack().to_frame().reset_index().dropna()

result = zz[zz[0] !=1].sort_values(0,ascending=False).drop_duplicates(0)
answer = result.head(5).reset_index(drop=True)
answer.columns = ['v1','v2','corr']
display(answer)

Question

각 년도별 하위 행복점수의 하위 5개 국가의 평균 행복점수를 구하여라

df.groupby('년도').tail(5)

df.groupby('년도').tail(5).groupby('년도').mean()

result = df.groupby('년도').tail(5).groupby('년도').mean()[['점수']]
print(result)

Question

2019년 데이터를 추출하고 해당데이터의 상대 GDP 평균 이상의 나라들과 평균 이하의 나라들의 행복점수 평균을 각각 구하고 그 차이값을 출력하라

over = df[df.상대GDP >= df.상대GDP.mean()]['점수'].mean()
under = df[df.상대GDP <= df.상대GDP.mean()]['점수'].mean()

result= over - under
print(result)

Question

각년도의 부패에 대한인식을 내림차순 정렬했을때 상위 20개 국가의 부패에 대한인식의 평균을 구하여라

df.sort_values(['년도','부패에 대한인식'],ascending=False)

df.sort_values(['년도','부패에 대한인식'],ascending=False).groupby('년도').head(20).groupby(['년도']).mean()

result = df.sort_values(['년도','부패에 대한인식'],ascending=False).groupby('년도').head(20).groupby(['년도']).mean()[['부패에 대한인식']]
print(result)

Question

2018년도 행복랭킹 50위 이내에 포함됐다가 2019년 50위 밖으로 밀려난 국가의 숫자를 구하여라

result = set(df[(df.년도==2018) & (df.행복랭킹 <=50)].나라명)  -set(df[(df.년도==2019) & (df.행복랭킹 <=50)].나라명)
answer = len(result)
print(answer)

Question

2018년,2019년 모두 기록이 있는 나라들 중 년도별 행복점수가 가장 증가한 나라와 그 증가 수치는?

count = df.나라명.value_counts()
count

type(count)

target = count[count>=2].index
target

count = df.나라명.value_counts()
target = count[count>=2].index

df2 =df.copy()
multiple = df2[df2.나라명.isin(target)].reset_index(drop=True)
multiple.loc[multiple['년도']==2018,'점수'] = multiple[multiple.년도 ==2018]['점수'].values * (-1)
result = multiple.groupby('나라명').sum()['점수'].sort_values().to_frame().iloc[-1]
result

'빅데이터분석기사 > 작업 1유형' 카테고리의 다른 글

포켓몬 정보 데이터 (0)	2022.08.15
지역구 에너지 소비량 데이터 (0)	2022.08.15
서울시 따릉이 이용정보 데이터 (0)	2022.08.15
월드컵 출전선수 골기록 데이터 (0)	2022.08.14
유튜브 공범 컨텐츠 동영상 데이터 (0)	2022.08.14

https://www.datamanim.com/dataset/03_dataq/typeone.html#id6

작업 1유형 — DataManim

Question 각 비디오는 10분 간격으로 구독자수, 좋아요, 싫어요수, 댓글수가 수집된것으로 알려졌다. 공범 EP1의 비디오정보 데이터중 수집간격이 5분 이하, 20분이상인 데이터 구간( 해당 시점 전,후

www.datamanim.com

Question

대여일자별 데이터의 수를 데이터프레임으로 출력하고, 가장 많은 데이터가 있는 날짜를 출력하라

import pandas as pd
df =pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/bicycle/seoul_bi.csv')
df.head()

df['대여일자'].value_counts().sort_index()

result = df['대여일자'].value_counts().sort_index().to_frame()
result

result = df['대여일자'].value_counts().sort_index().to_frame()
answer = result[result.대여일자  == result.대여일자.max()].index[0]

display(result)
print(answer)

Question

각 일자의 요일을 표기하고 (‘Monday’ ~’Sunday’) ‘day_name’컬럼을 추가하고 이를 이용하여 각 요일별 이용 횟수의 총합을 데이터 프레임으로 출력하라

df['대여일자'] = pd.to_datetime(df['대여일자'])
df['day_name']  = df['대여일자'].dt.day_name()

result =  df.day_name.value_counts().to_frame()
print(result)

Question

각 요일별 가장 많이 이용한 대여소의 이용횟수와 대여소 번호를 데이터 프레임으로 출력하라

df.groupby(['day_name','대여소번호']).size()

df.groupby(['day_name','대여소번호']).size().to_frame('size')

result = df.groupby(['day_name','대여소번호']).size().to_frame('size').sort_values(['day_name','size'],ascending=False).reset_index()
answer  = result.drop_duplicates('day_name',keep='first').reset_index(drop=True)
display(answer)

Question

나이대별 대여구분 코드의 (일일권/전체횟수) 비율을 구한 후 가장 높은 비율을 가지는 나이대를 확인하라.

일일권의 경우 일일권 과 일일권(비회원)을 모두 포함하라

daily = df[df.대여구분코드.isin(['일일권','일일권(비회원)'])].연령대코드.value_counts().sort_index()
total = df.연령대코드.value_counts().sort_index()

ratio = (daily /total).sort_values(ascending=False)
print(ratio)
print('max ratio age ',ratio.index[0])

Question

연령대별 평균 이동거리를 구하여라

result = df[['연령대코드','이동거리']].groupby(['연령대코드']).mean()
print(result)

Question

연령대 코드가 20대인 데이터를 추출하고,이동거리값이 추출한 데이터의 이동거리값의 평균 이상인 데이터를 추출한다.최종 추출된 데이터를 대여일자, 대여소 번호 순서로 내림차순 정렬 후 1행부터 200행까지의 탄소량의 평균을 소숫점 3째 자리까지 구하여라

tw = df[df.연령대코드 =='20대'].reset_index(drop=True)
tw_mean = tw[tw.이동거리 >= tw.이동거리.mean()].reset_index(drop=True)
tw_mean['탄소량'] =tw_mean['탄소량'].astype('float')
target =tw_mean.sort_values(['대여일자','대여소번호'], ascending=False).reset_index(drop=True).iloc[:200].탄소량
result = round(target.sum()/len(target),3)
print(result)

Question

6월 7일 ~10대의 “이용건수”의 중앙값은?

df['대여일자']  =pd.to_datetime(df['대여일자'])
result = df[(df.연령대코드 =='~10대') & (df.대여일자 ==pd.to_datetime('2021-06-07'))].이용건수.median()
print(result)

Question

평일 (월~금) 출근 시간대(오전 6,7,8시)의 대여소별 이용 횟수를 구해서 데이터 프레임 형태로 표현한 후 각 대여시간별 이용 횟수의 상위 3개 대여소와 이용횟수를 출력하라

target = df[(df.day_name.isin(['Tuesday', 'Wednesday', 'Thursday', 'Friday','Monday'])) & (df.대여시간.isin([6,7,8]))]
result = target.groupby(['대여시간','대여소번호']).size().to_frame('이용 횟수')

answer = result.sort_values(['대여시간','이용 횟수'],ascending=False).groupby('대여시간').head(3)
display(answer)

Question

이동거리의 평균 이상의 이동거리 값을 가지는 데이터를 추출하여 추출데이터의 이동거리의 표본표준편차 값을 구하여라

result  = df[df.이동거리 >= df.이동거리.mean()].reset_index(drop=True).이동거리.std()
print(result)

Question

남성(‘M’ or ‘m’)과 여성(‘F’ or ‘f’)의 이동거리값의 평균값을 구하여라

df['sex'] = df['성별'].map(lambda x: '남' if x in ['M','m'] else '여')
answer = df[['sex','이동거리']].groupby('sex').mean()
display(answer)

'빅데이터분석기사 > 작업 1유형' 카테고리의 다른 글

지역구 에너지 소비량 데이터 (0)	2022.08.15
전세계 행복도 지표 데이터 (0)	2022.08.15
월드컵 출전선수 골기록 데이터 (0)	2022.08.14
유튜브 공범 컨텐츠 동영상 데이터 (0)	2022.08.14
유튜브 인기동영상 데이터 (0)	2022.08.14

https://www.datamanim.com/dataset/03_dataq/typeone.html

작업 1유형 — DataManim

Question 각 비디오는 10분 간격으로 구독자수, 좋아요, 싫어요수, 댓글수가 수집된것으로 알려졌다. 공범 EP1의 비디오정보 데이터중 수집간격이 5분 이하, 20분이상인 데이터 구간( 해당 시점 전,후

www.datamanim.com

Question

주어진 전체 기간의 각 나라별 골득점수 상위 5개 국가와 그 득점수를 데이터프레임형태로 출력하라

import pandas as pd

df= pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/worldcup/worldcupgoals.csv')
df.head()

df.groupby('Country').sum()

result = df.groupby('Country').sum().sort_values('Goals',ascending=False).head(5)
display(result)

Question

주어진 전체기간동안 골득점을 한 선수가 가장 많은 나라 상위 5개 국가와 그 선수 숫자를 데이터 프레임 형식으로 출력하라

df.groupby('Country').size()

result = df.groupby('Country').size().sort_values(ascending=False).head(5)
print(result)

Question

Years 컬럼은 년도 -년도 형식으로 구성되어있고, 각 년도는 4자리 숫자이다.

년도 표기가 4자리 숫자로 안된 케이스가 존재한다. 해당 건은 몇건인지 출력하라

df['yearLst'] = df.Years.str.split('-')

def checkFour(x):
    for value in x:
        if len(str(value)) != 4:
            return False
        
    return True
    
df['check'] = df['yearLst'].apply(checkFour)

result = len(df[df.check ==False])
result

Question

**Q3에서 발생한 예외 케이스를 제외한 데이터프레임을 df2라고 정의하고 데이터의 행의 숫자를 출력하라

(아래 문제부터는 df2로 풀이하겠습니다) **

df2 = df[df.check ==True].reset_index(drop=True)
print(df2.shape[0])

Question

월드컵 출전횟수를 나타내는 ‘LenCup’ 컬럼을 추가하고 4회 출전한 선수의 숫자를 구하여라

df2['LenCup'] =df2['yearLst'].str.len()
df2

df2['LenCup'].value_counts()

df2['LenCup'] =df2['yearLst'].str.len()
result = df2['LenCup'].value_counts()[4]
print(result)

Question

Yugoslavia 국가의 월드컵 출전횟수가 2회인 선수들의 숫자를 구하여라

result = len(df2[(df2.LenCup==2) & (df2.Country =='Yugoslavia')])
print(result)

Question

2002년도에 출전한 전체 선수는 몇명인가?

result =len(df2[df2.Years.str.contains('2002')])
print(result)

Question

이름에 ‘carlos’ 단어가 들어가는 선수의 숫자는 몇 명인가? (대, 소문자 구분 x)

result = len(df2[df2.Player.str.lower().str.contains('carlos')])
print(result)

Question

월드컵 출전 횟수가 1회뿐인 선수들 중에서 가장 많은 득점을 올렸던 선수는 누구인가?

df2[df2.LenCup==1].sort_values('Goals',ascending=False)

df2[df2.LenCup==1].sort_values('Goals',ascending=False).Player

result = df2[df2.LenCup==1].sort_values('Goals',ascending=False).Player.values[0]
print(result)

pandas.DataFrame.values

: return a Numpy representation of the DataFrame

Question

월드컵 출전횟수가 1회 뿐인 선수들이 가장 많은 국가는 어디인가?

df2[df2.LenCup==1].Country.value_counts()

result= df2[df2.LenCup==1].Country.value_counts().index[0]
print(result)

'빅데이터분석기사 > 작업 1유형' 카테고리의 다른 글

지역구 에너지 소비량 데이터 (0)	2022.08.15
전세계 행복도 지표 데이터 (0)	2022.08.15
서울시 따릉이 이용정보 데이터 (0)	2022.08.15
유튜브 공범 컨텐츠 동영상 데이터 (0)	2022.08.14
유튜브 인기동영상 데이터 (0)	2022.08.14

https://www.datamanim.com/dataset/03_dataq/typeone.html

작업 1유형 — DataManim

Question 각 비디오는 10분 간격으로 구독자수, 좋아요, 싫어요수, 댓글수가 수집된것으로 알려졌다. 공범 EP1의 비디오정보 데이터중 수집간격이 5분 이하, 20분이상인 데이터 구간( 해당 시점 전,후

www.datamanim.com

import pandas as pd
channel =pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/youtube/channelInfo.csv')
video =pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/youtube/videoInfo.csv')
display(channel.head())
display(video.head())

Question

각 데이터의 ‘ct’컬럼을 시간으로 인식할수 있게 datatype을 변경하고 video 데이터의 videoname의 각 value 마다 몇개의 데이터씩 가지고 있는지 확인하라

video['ct'] = pd.to_datetime(video['ct'])
answer = video.videoname.value_counts()
print(answer)

Question

수집된 각 video의 가장 최신화 된 날짜의 viewcount값을 출력하라

answer = video.sort_values(['videoname','ct']).drop_duplicates('videoname',keep='last')[['viewcnt','videoname','ct']].reset_index(drop=True)
display(answer)

drop_duplicates('', keep = 'last') 설정 - 가장 최신화 된 날짜 구하기 위해

*keep default값은 'first'이다

Question

Channel 데이터중 2021-10-03일 이후 각 채널의 처음 기록 됐던 구독자 수(subcnt)를 출력하라

channel.ct = pd.to_datetime(channel.ct)
target = channel[channel.ct >= pd.to_datetime('2021-10-03')].sort_values(['ct','channelname']).drop_duplicates('channelname')
answer = target[['channelname','subcnt']].reset_index(drop=True)
print(answer)

Question

각채널의 2021-10-03 03:00:00 ~ 2021-11-01 15:00:00 까지 구독자수 (subcnt) 의 증가량을 구하여라

end = channel.loc[channel.ct.dt.strftime('%Y-%m-%d %H') =='2021-11-01 15']
start = channel.loc[channel.ct.dt.strftime('%Y-%m-%d %H') =='2021-10-03 03']

end_df = end[['channelname','subcnt']].reset_index(drop=True)
start_df = start[['channelname','subcnt']].reset_index(drop=True)

end_df.columns = ['channelname','end_sub']
start_df.columns = ['channelname','start_sub']


tt = pd.merge(start_df,end_df)
tt['del'] = tt['end_sub'] - tt['start_sub']
result = tt[['channelname','del']]
display(result)

Question

각 비디오는 10분 간격으로 구독자수, 좋아요, 싫어요수, 댓글수가 수집된것으로 알려졌다.

공범 EP1의 비디오정보 데이터중 수집간격이 5분 이하, 20분이상인 데이터 구간( 해당 시점 전,후) 의 시각을 모두 출력하라

import datetime

ep_one = video.loc[video.videoname.str.contains('1')].sort_values('ct').reset_index(drop=True)

ep_one[
        (ep_one.ct.diff(1) >=datetime.timedelta(minutes=20)) | \
        (ep_one.ct.diff(1) <=datetime.timedelta(minutes=5))
      
      ]

answer = ep_one[ep_one.index.isin([720,721,722,723,1635,1636,1637])]
display(answer)

Question

각 에피소드의 시작날짜(년-월-일)를 에피소드 이름과 묶어 데이터 프레임으로 만들고 출력하라

start_date = video.sort_values(['ct','videoname']).drop_duplicates('videoname')[['ct','videoname']]
start_date['date'] = start_date.ct.dt.date
answer = start_date[['date','videoname']]
display(answer)

Question

“공범” 컨텐츠의 경우 19:00시에 공개 되는것으로 알려져있다.

공개된 날의 21시의 viewcnt, ct, videoname 으로 구성된 데이터 프레임을 viewcnt를 내림차순으로 정렬하여 출력하라

video['time']= video.ct.dt.hour

answer = video.loc[video['time'] ==21] \
            .sort_values(['videoname','ct'])\
            .drop_duplicates('videoname') \
            .sort_values('viewcnt',ascending=False)[['videoname','viewcnt','ct']]\
            .reset_index(drop=True)

display(answer)

Question

video 정보의 가장 최근 데이터들에서 각 에피소드의 싫어요/좋아요 비율을 ratio 컬럼으로 만들고 videoname, ratio로 구성된 데이터 프레임을 ratio를 오름차순으로 정렬하라

target = video.sort_values('ct').drop_duplicates('videoname',keep='last')
target['ratio'] =target['dislikecnt'] / target['likecnt']

answer = target.sort_values('ratio')[['videoname','ratio']].reset_index(drop=True)
answer

Question

2021-11-01 00:00:00 ~ 15:00:00까지 각 에피소드별 viewcnt의 증가량을 데이터 프레임으로 만드시오

start = pd.to_datetime("2021-11-01 00:00:00")
end = pd.to_datetime("2021-11-01 15:00:00")

target = video.loc[(video["ct"] >= start) & (video['ct'] <= end)].reset_index(drop=True)
target

start = pd.to_datetime("2021-11-01 00:00:00")
end = pd.to_datetime("2021-11-01 15:00:00")

target = video.loc[(video["ct"] >= start) & (video['ct'] <= end)].reset_index(drop=True)

def check(x):
    result = max(x) - min(x)
    return result

answer = target[['videoname','viewcnt']].groupby("videoname").agg(check)
answer

Question

video 데이터 중에서 중복되는 데이터가 존재한다.

중복되지 않는 각 데이터의 시간대와 videoname 을 구하여라

video.drop_duplicates().index

answer  = video[video.index.isin(set(video.index) -  set(video.drop_duplicates().index))]
result = answer[['videoname','ct']]
display(result)

'빅데이터분석기사 > 작업 1유형' 카테고리의 다른 글

지역구 에너지 소비량 데이터 (0)	2022.08.15
전세계 행복도 지표 데이터 (0)	2022.08.15
서울시 따릉이 이용정보 데이터 (0)	2022.08.15
월드컵 출전선수 골기록 데이터 (0)	2022.08.14
유튜브 인기동영상 데이터 (0)	2022.08.14

https://www.datamanim.com/dataset/03_dataq/typeone.html

작업 1유형 — DataManim

Question 각 비디오는 10분 간격으로 구독자수, 좋아요, 싫어요수, 댓글수가 수집된것으로 알려졌다. 공범 EP1의 비디오정보 데이터중 수집간격이 5분 이하, 20분이상인 데이터 구간( 해당 시점 전,후

www.datamanim.com

Question

인기동영상 제작횟수가 많은 채널 상위 10개명을 출력하라 (날짜기준, 중복포함)

import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/youtube/youtube.csv",index_col=0)
df.head()

#value_counts() 구하면 내림차순으로 나옴
df.channelId.value_counts()

df.channelId.value_counts().head(10).index

answer =list(df.loc[df.channelId.isin(df.channelId.value_counts().head(10).index)].channelTitle.unique())
print(answer)

Question

논란으로 인기동영상이 된 케이스를 확인하고 싶다.

dislikes수가 like 수보다 높은 동영상을 제작한 채널을 모두 출력하라

answer =list(df.loc[df.likes < df.dislikes].channelTitle.unique())
print(answer)

Question

채널명을 바꾼 케이스가 있는지 확인하고 싶다.

channelId의 경우 고유값이므로 이를 통해 채널명을 한번이라도 바꾼 채널의 갯수를 구하여라

change = df[['channelTitle','channelId']].drop_duplicates().channelId.value_counts()
change

change = df[['channelTitle','channelId']].drop_duplicates().channelId.value_counts()
target = change[change>1]
print(len(target))

Question

일요일에 인기있었던 영상들중 가장많은 영상 종류(categoryId)는 무엇인가?

df.info()

type(df['trending_date2'][0])

df['trending_date2'] = pd.to_datetime(df['trending_date2'])
answer =df.loc[df['trending_date2'].dt.day_name() =='Sunday'].categoryId.value_counts().index[0]
print(answer)

Question

각 요일별 인기 영상들의 categoryId는 각각 몇개 씩인지 하나의 데이터 프레임으로 표현하라

group = df.groupby([df['trending_date2'].dt.day_name(),'categoryId'],as_index=False).size()
group

group = df.groupby([df['trending_date2'].dt.day_name(),'categoryId'],as_index=False).size()
answer= group.pivot(index='categoryId',columns='trending_date2')
display(answer)

Question

댓글의 수로 (comment_count) 영상 반응에 대한 판단을 할 수 있다.

viewcount대비 댓글수가 가장 높은 영상을 확인하라 (view_count값이 0인 경우는 제외한다)

target2= df.loc[df.view_count!=0]
t = target2.copy()
t['ratio'] = (target2['comment_count']/target2['view_count']).dropna()

t.sort_values(by='ratio', ascending=False)

target2= df.loc[df.view_count!=0]
t = target2.copy()
t['ratio'] = (target2['comment_count']/target2['view_count']).dropna()
result = t.sort_values(by='ratio', ascending=False).iloc[0].title
print(result)

Question

댓글의 수로 (comment_count) 영상 반응에 대한 판단을 할 수 있다.

viewcount대비 댓글수가 가장 낮은 영상을 확인하라 (view_counts, ratio값이 0인경우는 제외한다.)

ratio = (df['comment_count'] / df['view_count']).dropna().sort_values()
ratio

ratio = (df['comment_count'] / df['view_count']).dropna().sort_values()
ratio[ratio!=0].index[0]

result= df.iloc[ratio[ratio!=0].index[0]].title
print(result)

Question

like 대비 dislike의 수가 가장 적은 영상은 무엇인가? (like, dislike 값이 0인경우는 제외한다)

target = df.loc[(df.likes !=0) & (df.dislikes !=0)]
num = (target['dislikes']/target['likes']).sort_values().index[0]
num

target = df.loc[(df.likes !=0) & (df.dislikes !=0)]
num = (target['dislikes']/target['likes']).sort_values().index[0]

answer = df.iloc[num].title
print(answer)

Question

가장많은 트렌드 영상을 제작한 채널의 이름은 무엇인가? (날짜기준, 중복포함)

df.loc[df.channelId ==df.channelId.value_counts().index[0]].channelTitle.unique()

answer = df.loc[df.channelId ==df.channelId.value_counts().index[0]].channelTitle.unique()[0]
print(answer)

Question

20회(20일)이상 인기동영상 리스트에 포함된 동영상의 숫자는?

df[['title','channelId']]

df[['title','channelId']].value_counts()

answer= (df[['title','channelId']].value_counts()>=20).sum()
print(answer)

'빅데이터분석기사 > 작업 1유형' 카테고리의 다른 글

지역구 에너지 소비량 데이터 (0)	2022.08.15
전세계 행복도 지표 데이터 (0)	2022.08.15
서울시 따릉이 이용정보 데이터 (0)	2022.08.15
월드컵 출전선수 골기록 데이터 (0)	2022.08.14
유튜브 공범 컨텐츠 동영상 데이터 (0)	2022.08.14

hyerimir_archive