YY Censorship Research

“【自我说明】包含敏感字符，请重新输入。” ([Profile] contains sensitive characters, please try again.) Note that for censorship of text chat, triggering messages do not display this warning.

Check out the...

corresponding paper and slides.

Censorship Analysis

YY 7.1 downloads three different keywords lists:

Finance Keyword List (48 words) Source: http://do.yy.duowan.com/financekwordlist
These are keywords related to phishing scams. When received, YY prints the following warning in the chat window:
YY安全提示：聊天中若有涉及财产的操作，请一定要先核实好友身份，谨防受骗！
The keywords are downloaded in plain text in UTF8-encoded XML.
Decoded "Normal" Keyword List (22 words) Source: http://do.yy.duowan.com/NormalKWordlist.txt
These are sensitive keywords that are asterisked out in the message before the message is sent and that trigger a surveillance message back to YY's servers. These keywords are downloaded as a base64-encoded list of UTF16-encoded keywords each separated by a carriage return followed by a line feed.
Decoded "High" Keyword List (13,461 words) Source: http://do.yy.duowan.com/HighKWordlist.txt
These are sensitive keywords that cause the containing message to never be sent and that trigger a surveillance message back to YY's servers. If a message containing one of these keywords is somehow received, then that message will show in the chat window as a blank message. These keywords are downloaded as a base64-encoded list of UTF16-encoded keywords each separated by a carriage return followed by a line feed.

Surveillance Analysis

When sending a word from the "Normal" or "High" lists above, a surveillance message is sent via an HTTP GET request to a URL of the form:

http://sere.hiido.com/do.action?id=<id>&content=<content>

<id> is a hash computed as md5(⌊<seconds since unix epoch> / 1000⌋ + ";username=report;password=pswd@1234"), hex-encoded. Note that the username and password in the hashed string are hardcoded; these are not the username and password of the sender or receiver of the triggering message.

<content> is a base64-encoded string of the following form:

type=2;uid=<sending user id #>;touid=<receiving user id #>;keyword=<triggering keyword>;txt=<triggering message in its entirety>

In the version of YY analyzed, type is hardcoded to 2.

Code

decode.py is a python script for automating the decoding of the "normal" and "high" lists into plain text UTF8.