ti-enxame.com

Como remover Hashtag, @user, link de um tweet usando expressão regular

Eu preciso pré-processar tweets usando python. Agora estou me perguntando qual seria a expressão regular para remover todas as hashtags, @user e links de um tweet respectivamente?

por exemplo,

  1. original Tweet: @peter I really love that shirt at #Macy. http://bet.ly//WjdiW4 [.____]
    • tweet processado: I really love that shirt at Macy
  2. tweet original: @shawn Titanic tragedy could have been prevented Economic Times: Telegraph.co.ukTitanic tragedy could have been preve... http://bet.ly/tuN2wx [.____]
    • tweet processado: Titanic tragedy could have been prevented Economic Times Telegraph co ukTitanic tragedy could have been preve
  3. tweet original: I am at Starbucks http://4sh.com/samqUI (7419 3rd ave, at 75th, Brooklyn)
    • tweet processado: I am at Starbucks 7419 3rd ave at 75th Brooklyn

Eu só preciso das palavras significativas em cada tweet. Eu não preciso do nome de usuário, ou quaisquer links ou quaisquer pontuações.

14
Peiti Li

Um pouco tarde, mas esta solução evita erros de pontuação como # Hashtag1, # Hashtag2 (sem espaços), e implementação é muito simples

import re,string

def strip_links(text):
    link_regex    = re.compile('((https?):((//)|(\\\\))+([\w\d:#@%/;$()~_?\+-=\\\.&](#!)?)*)', re.DOTALL)
    links         = re.findall(link_regex, text)
    for link in links:
        text = text.replace(link[0], ', ')    
    return text

def strip_all_entities(text):
    entity_prefixes = ['@','#']
    for separator in  string.punctuation:
        if separator not in entity_prefixes :
            text = text.replace(separator,' ')
    words = []
    for Word in text.split():
        Word = Word.strip()
        if Word:
            if Word[0] not in entity_prefixes:
                words.append(Word)
    return ' '.join(words)


tests = [
    "@peter I really love that shirt at #Macy. http://bet.ly//WjdiW4",
    "@shawn Titanic tragedy could have been prevented Economic Times: Telegraph.co.ukTitanic tragedy could have been preve... http://bet.ly/tuN2wx",
    "I am at Starbucks http://4sh.com/samqUI (7419 3rd ave, at 75th, Brooklyn)",
]
for t in tests:
    strip_all_entities(strip_links(t))


#'I really love that shirt at'
#'Titanic tragedy could have been prevented Economic Times Telegraph co ukTitanic tragedy could have been preve'
#'I am at Starbucks 7419 3rd ave at 75th Brooklyn'
10
xecgr

Eu sei que não é um regex, mas:

>>>
>>> import urlparse
>>> string = '@peter I really love that shirt at #Macy. http://bit.ly//WjdiW#'
>>> new_string = ''
>>> for i in string.split():
...     s, n, p, pa, q, f = urlparse.urlparse(i)
...     if s and n:
...         pass
...     Elif i[:1] == '@':
...         pass
...     Elif i[:1] == '#':
...         new_string = new_string.strip() + ' ' + i[1:]
...     else:
...         new_string = new_string.strip() + ' ' + i
...
>>> new_string
'I really love that shirt at Macy.'
>>>
0
Ben